Thursday, 27 December 2012

How To Import/Index Large Blogs On Blogger?

Sharing some of the experience behind creating the Find-Word blog series.

In order to create those blogs, I first tried to use the Google APIs, but unfortunately, these are not suited for Java standalone applications. The authentication mecanism is way too overkill compared to real needs and it does not seem to have been tested properly with standalone applications. I moved to Google's older Data APIs and they worked like a charm. The only issue was the limitation on the number of authorized posts per day (50, then a captcha is displayed). I finally decided to reverse engineer the import-export blogger functionality.

About importing posts:
  • One can split posts into multiple import files, and load them one by one.
  • Blogger won't be able to load files bigger than 30 MB (it will crash).
  • Once loaded, the posts have the imported status.
  • One can publish at most 1800 posts the first day, then about 500 the next day.
  • Crossing those limits will flag your blog as potential spam.
  • One cannot publish posts anymore and a request for review must be introduced, else the blog will be deleted.
About indexing:
  • Of course, it is not possible to force Google to index all blog posts. One can only have Google crawl your pages. Google will then decide what goes into the Index.
  • The default Blogger sitemap only sees the last 26 posts. So, unless there are external or internal links to the remaining posts, Google will not be able to access them. They won't be crawled or indexed.
  • The solution is to add atom sitemaps (for example: /atom.xml?redirect=false&start-index=1&max-results=500 for the first 500 posts) using Google Webmaster's Optimization > Sitemaps page. Multiple atom sitemaps must be added if necessary:
Atom Sitemaps
  • If your blog posts are lengthy and Google can't process your sitemaps, configure your blog to allow short blog feeds in Settings > Other. If your blog does not allow feeds, Google will not be able to process the sitemaps too!
  • The Sitemaps page will tells you whether sitemaps have been processed successfully.
  • Google Webmaster has a page indicating which links have been indexed in Health > Index Status. It is updated about once per week, which is too slow for useful feedback.
  • The solution is to use the site:myblog.blogspot.com command from the Google search page to have an estimation of how many pages Google knows about your blog. It also tells you whether crawling is successful or not.
  • Google Webmaster has a Crawled Stats page which tells you (with a 2 days delay) how many pages it has been crawling per day. This is also a good indicator of the reachability of posts.
  • Be patient!
That's it !!!