Friday, August 7, 2009

RSS feeds as sitemaps

One good way to optimize your blog or news site for Google is to submit a sitemap. This could be done though Google webmaster tools.
There are several different types of sitemap formats, as described in the sitemap protocol. Probably the most natural and easiest to submit is the RSS sitemap format. Since your site probably already publishes your content as RSS to which your users can subscribe, you will not have to put much additional effort into this by using the same feed for the search engines.

If your site publishes several different RSS feeds, probably one for each category that your site covers, you could of course submit all of them to Google. In my case I have over 3000 different RSS URLs, so decided to first give it a try with a single feed - the one that publishes all my new articles regardless of their category.

Before submitting the RSS sitemap to a search engine I recommend that you first check the validity of the feed you would like to submit through the W3C RSS validator. This could save you some problems wondering for what reason Google rejects your sitemap. For example one thing that lots of CMSs mess up when generating a geed is the pubDate format. The W3C validator kept insisting that my BST (British Summer Time) pubDate is implausible, i.e it is in the future. I managed to go around this through changing the format from TimeZone to time difference from GMT. Namely I switched from
Fri, 07 Aug 2009 20:44:19 BST
to
Fri, 07 Aug 2009 20:44:19 +0100

Another peculiarity is that the W3C validator recommends that you should insert a element in the channel section of your RSS feed. When your feed satisfies this recommendation the W3C validator suggests you to place a Valid RSS image to your site. So I assume that it shows you that it is perfectly happy with the feed. And here comes the discrepancy between W3C and Google. In simple words: Google just doesn't like the element there. Well, Google didn't mark my feed sitemap as containing errors, but rather notified me with a warnings with the not very meaningful message: "Invalid XML: too many tags" which misleadingly suggests that I have duplicate tags inside my feed, which was not the case. Removing the element fixed the issue in Google's eyes and my sitemap turned green in the Google webmaster tools. So it is up to you to decide whether to satisfy W3C or Google entirely. I personally chose the search engine.

To finish this post - don't expect that right after googlebot reads you sitemap(s), it will start crawling and indexing your pages immediately. This will likely take some time, don't push it.

No comments: