Crawling and Spidering the Web
If you imagine the internet as a series of pages each of which is linked
to from other pages and in turn links and on to further pages you can see the
problems that search engines have in creating their databases. They manage
to build
their sets of results by crawling the web. Crawling is
the method of following links on the web both from one site to another and also from one page to
another on the same website different websites, and then gathering the contents
of these websites for storage in the search engines databases.
Crawling the internet can start from a single point (starting with a popular
website containing lots of links, such as DMOZ)
or from an existing, older indexes of websites. The crawler (also known as
a web robot, bot, or web
spider) is a software program that can download web content (mainly web pages
but also, in some cases, images, documents and other files) and then follow
links within these web pages to download the linked contents. The linked contents
can be on the same site or on a different website. The crawling continues until
it finds a logical stop, such as a dead end with no links or reaching a pre-set
number of levels inside the website's link structure. It goes without saying
that if a website is not linked from any other page on the internet the
bot will be
unable to
locate
it in include it it the search engine database. Conversely, if you have a page
with just a single link into it there is a chance that it will be found.
Therefore, if the website is new, and has no links from other sites, that website
has to be submitted to each of the search engines for crawling, although it
is better to have links from other sites.
The efficiency of the webbot means it can crawl multiple websites at the
same time, so as to collect billions of website contents (Google used to claim that it searches 8,058,044,651 web pages) as frequently as it can. Some
sites such as News and media sites are crawled more frequently (possibly every
hour
or so) by advanced search engines like Google,
in order to deliver updated news and content in their search results, whilst
other "less important" sites may be spidered on a daily, weekly or evern monthly
basis.
Although the webbot is visiting your site to get as many pages as possible,
if it is well written it should not flood a single website with a high volume
of
requests
at the
same
time,
but
spreads
the crawling over a period of time so that the web site does not crash from
trying to serve too many pages at once.
Usually search engines crawl only
a few (three or four) levels deep from the
homepage
of a website. Note that this is not the number of directory levels that
the page exists in but the 'number of clicks' needed to get to the page from
the home page. One way of ensuring that all of your pages are crawled if
you
have
a large site is to build a site map that contains a link to every page
on your website, link to this sitemap from your home page and the crawlers
will
find
the site map on their next visit ( using the generally accepted 3 click
rule : Home Page - Click - Site Map - Click - Any
page on the site = 2 clicks).
You may see the term 'deep crawl' used in SEO (search engine optimisation)
forums and websites, the term deep crawl is used to denote that the crawler
or spider
can index
pages
that
are many levels deep, not that the spider is more capable of reading pages
in deep sub-directories. Google and MSN are
examples of a deep crawler.
Controlling Crawlers
Crawlers or web robots follow guidelines specified for them by the website
owner using the robots exclusion protocol (robots.txt). The robots.txt can
specify the files or folders that the owner does not want the crawler to index
in its
database. Details of the format of the robots.txt file are available on many
websites including the Search
Engine World site
Spotting Crawlers
If you have access to your websites log files you can easily spot the majority
of crawlers as they access your website, for example extracts from a recent
website log contained (with the crawler identified in green) :-
66.249.64.13 - - [06/Mar/2005:05:21:04 +0000] "GET /search-engine-article1.html HTTP/1.0" 200 16056 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
66.249.65.13 - - [13/Jul/2006:12:13:31 +0100] "GET /dell-desktop.html HTTP/1.1" 200 15418 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.13 - - [13/Jul/2006:12:15:27 +0100] "GET /credit-control.html HTTP/1.1" 200 19423 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
and later :-
64.4.8.116 - - [13/Jul/2006:13:08:30 +0100] "GET /company-credit-checking.html HTTP/1.0" 200 30996 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:15:17:13 +0100] "GET /search-positions.html HTTP/1.0" 200 18252 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
65.214.44.192 - - [13/Jul/2006:16:09:58 +0100] "GET /dwodp/index.php/Regional/Europe/United_Kingdom/Business_and_Economy/ HTTP/1.0" 200 23722 "-" "Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://sp.ask.com/docs/about/tech_crawling.html)"
and later still :-
64.4.8.116 - - [13/Jul/2006:21:15:38 +0100] "GET /business-insurance.html HTTP/1.0" 200 21131 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:15:41 +0100] "GET /business-types.html HTTP/1.0" 200 19011 "-" msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:15:52 +0100] "GET /working-at-home.html HTTP/1.0" 200 18707 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:15:53 +0100] "GET /accountants.php HTTP/1.0" 200 16297 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:15:56 +0100] "GET /dell-handhelds.htm HTTP/1.0" 200 14925 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:15:56 +0100] "GET /dell-pdas.htm HTTP/1.0" 200 14480 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:16:06 +0100] "GET /small-business-bank-accounts.html HTTP/1.0" 200 19241 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
72.30.98.136 - - [13/Jul/2006:21:16:57 +0100] "GET /dwodp/index.php/Regional/Europe/United_Kingdom/Business_and_Economy/ HTTP/1.0" 200 23904 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
64.4.8.116 - - [13/Jul/2006:21:17:11 +0100] "GET /office-types.html HTTP/1.0" 200 23444 "-" msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:17:11 +0100] "GET /ink-cartridges.html HTTP/1.0" 200 15748 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:17:12 +0100] "GET /credit-control.html HTTP/1.0" 200 19355 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:17:13 +0100] "GET /document-data-entry-scanning.html HTTP/1.0" 200 16242 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
64.4.8.116 - - [13/Jul/2006:21:17:26 +0100] "GET /company-credit-checking.html HTTP/1.0" 200 30996 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm
Back to the Search Engine Article Page
|