Crawling and Spidering the Web

If you imagine the internet as a series of pages each of which is linked to from other pages and in turn links and on to further pages you can see the problems that search engines have in creating their databases. They manage to build their sets of results by crawling the web. Crawling is the method of following links on the web both from one site to another and also from one page to another on the same website different websites, and then gathering the contents of these websites for storage in the search engines databases.

Crawling the internet can start from a single point (starting with a popular website containing lots of links, such as DMOZ) or from an existing, older indexes of websites. The crawler (also known as a web robot, bot, or web spider) is a software program that can download web content (mainly web pages but also, in some cases, images, documents and other files) and then follow links within these web pages to download the linked contents. The linked contents can be on the same site or on a different website. The crawling continues until it finds a logical stop, such as a dead end with no links or reaching a pre-set number of levels inside the website's link structure. It goes without saying that if a website is not linked from any other page on the internet the bot will be unable to locate it in include it it the search engine database. Conversely, if you have a page with just a single link into it there is a chance that it will be found. Therefore, if the website is new, and has no links from other sites, that website has to be submitted to each of the search engines for crawling, although it is better to have links from other sites.

The efficiency of the webbot means it can crawl multiple websites at the same time, so as to collect billions of website contents (Google used to claim that it searches 8,058,044,651 web pages) as frequently as it can. Some sites such as News and media sites are crawled more frequently (possibly every hour or so) by advanced search engines like Google, in order to deliver updated news and content in their search results, whilst other "less important" sites may be spidered on a daily, weekly or evern monthly basis.

Although the webbot is visiting your site to get as many pages as possible, if it is well written it should not flood a single website with a high volume of requests at the same time, but spreads the crawling over a period of time so that the web site does not crash from trying to serve too many pages at once.

Usually search engines crawl only a few (three or four) levels deep from the homepage of a website. Note that this is not the number of directory levels that the page exists in but the 'number of clicks' needed to get to the page from the home page. One way of ensuring that all of your pages are crawled if you have a large site is to build a site map that contains a link to every page on your website, link to this sitemap from your home page and the crawlers will find the site map on their next visit ( using the generally accepted 3 click rule : Home Page - Click - Site Map - Click - Any page on the site = 2 clicks).

You may see the term 'deep crawl' used in SEO (search engine optimisation) forums and websites, the term deep crawl is used to denote that the crawler or spider can index pages that are many levels deep, not that the spider is more capable of reading pages in deep sub-directories. Google and MSN are examples of a deep crawler.

Controlling Crawlers

Crawlers or web robots follow guidelines specified for them by the website owner using the robots exclusion protocol (robots.txt). The robots.txt can specify the files or folders that the owner does not want the crawler to index in its database. Details of the format of the robots.txt file are available on many websites including the Search Engine World site

Spotting Crawlers

If you have access to your websites log files you can easily spot the majority of crawlers as they access your website, for example extracts from a recent website log contained (with the crawler identified in green) :-

66.249.64.13 - - [06/Mar/2005:05:21:04 +0000] "GET /search-engine-article1.html HTTP/1.0" 200 16056 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
66.249.65.13 - - [13/Jul/2006:12:13:31 +0100] "GET /dell-desktop.html HTTP/1.1" 200 15418 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.13 - - [13/Jul/2006:12:15:27 +0100] "GET /credit-control.html HTTP/1.1" 200 19423 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

and later :-

64.4.8.116 - - [13/Jul/2006:13:08:30 +0100] "GET /company-credit-checking.html HTTP/1.0" 200 30996 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:15:17:13 +0100] "GET /search-positions.html HTTP/1.0" 200 18252 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
65.214.44.192 - - [13/Jul/2006:16:09:58 +0100] "GET /dwodp/index.php/Regional/Europe/United_Kingdom/Business_and_Economy/ HTTP/1.0" 200 23722 "-" "Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://sp.ask.com/docs/about/tech_crawling.html)"

and later still :-
64.4.8.116 - - [13/Jul/2006:21:15:38 +0100] "GET /business-insurance.html HTTP/1.0" 200 21131 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:15:41 +0100] "GET /business-types.html HTTP/1.0" 200 19011 "-" msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:15:52 +0100] "GET /working-at-home.html HTTP/1.0" 200 18707 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:15:53 +0100] "GET /accountants.php HTTP/1.0" 200 16297 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:15:56 +0100] "GET /dell-handhelds.htm HTTP/1.0" 200 14925 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:15:56 +0100] "GET /dell-pdas.htm HTTP/1.0" 200 14480 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:16:06 +0100] "GET /small-business-bank-accounts.html HTTP/1.0" 200 19241 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
72.30.98.136 - - [13/Jul/2006:21:16:57 +0100] "GET /dwodp/index.php/Regional/Europe/United_Kingdom/Business_and_Economy/ HTTP/1.0" 200 23904 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
64.4.8.116 - - [13/Jul/2006:21:17:11 +0100] "GET /office-types.html HTTP/1.0" 200 23444 "-" msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:17:11 +0100] "GET /ink-cartridges.html HTTP/1.0" 200 15748 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:17:12 +0100] "GET /credit-control.html HTTP/1.0" 200 19355 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:17:13 +0100] "GET /document-data-entry-scanning.html HTTP/1.0" 200 16242 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:17:26 +0100] "GET /company-credit-checking.html HTTP/1.0" 200 30996 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm

Back to the Search Engine Article Page