Google Tips for Getting Crawled Faster
In order for your site to be easily found in a search engine it needs to be crawled and indexed by a web crawler or spider. There are a few steps that can be taken to make this process as easy as possible for the search engine and in turn for your Webmaster. Recently Google posted a presentation with tips on how to optimize the crawling and indexing procedure. As Google is pretty much the dominant search engine is makes sense to listen to what they have to say about what to do to enhance your site’s crawability and what to stay away from.
Here are some of the highlights from the presentation according to Google Webmaster Trends Analyst Susan Moskwa. “Google has a finite number of resources, so when faced with the nearly-infinite quantity of content that’s available online, Googlebot is only able to find and crawl a percentage of that content. Then, of the content we’ve crawled, we’re only able to index a portion. URLs are like the bridges between your website and a search engine’s crawler: crawlers need to be able to find and cross those bridges (i.e., find and crawl your URLs) in order to get to your site’s content,” continues Moskwa. “If your URLs are complicated or redundant, crawlers are going to spend time tracing and retracing their steps; if your URLs are organized and lead directly to distinct content, crawlers can spend their time accessing your content rather than crawling through empty pages, or crawling the same content over and over via different URLs.”
So in order to get crawled faster by Google, you should remove user-specific details from URLs. Basically, remove URL parameters that don’t change the content of the page, and put them into a cookie. This will speed up crawling by reducing the number of URLs that point to the same content. Also Google says infinite spaces are a waste of time and bandwidth for all; consider this if you have calendars that link to infinite numbers of past/future dates with unique URLs.
Be sure to tell Google to ignore pages it can’t crawl. Included in this list are things that require users to perform actions that crawlers can’t perform themselves such as login pages, contact forms and shopping carts. You can do this with the robots.txt file. Finally, avoid duplicate content. Google likes to have one URL for each piece of content. However because of content management systems and whatnot this isn’t always possible, which is why the canonical link element exists to let you specify the preferred URL for a particular piece of content.
Specifics of this can be viewed in the slideshow at Google’s Webmaster Blog




