robots.txt and search engine spiders
What To Do When A Search Engine or Robot Knocks At Your Door.
Do you know who is visiting your website at night in the dark shadows?
Search engines use programs that are called spiders, crawlers or robots to visit your site and gather the information on your web pages. These robots leave evidence behind in your access logs, just as visitors do. If you know what to look for, you can tell when a spider has come to call. That can save you worrying that you haven't been visited. You can tell exactly what a robot has recorded or failed to record. You can also spot robots that may be making a large number of requests, which can affect your page impression statistics or even burden your server resources.
How do you identify a spider?
Those from the major search engines can sometimes be identified from their host names. These often incorporate part of the search engine's name or the company's name. For example, one of AltaVista's host names is scooter.pa-x.dec.com.
Below is a chart with some of the names Search Engines use. Please be reminded that they change the names from time to time. But this will give you an idea of what you are looking for.
AltaVista
(normal spider) |
Scooter/2.0 G.R.A.B. X2.0
Scooter/1.0 scooter@pa.dec.com |
scooter.pa-x.dec.com
scooter*.av.pa-x.dec.com
such as: scooter3.av.pa-x.dec.com |
AltaVista
(instant spider) |
Scooter/1.0 |
add-url.altavista.digital.com
ww2.altavista.digital.com |
| Euroseek |
Arachnoidea (arachnoidea@euroseek.com) |
*.euroseek.net
such as: infra.euroseek.net |
Google
(Experimental search engine) |
BackRub/2.1 backrub@google.stanford.edu |
*.stanford.edu
such as: hake.stanford.edu |
Inktomi
(powers HotBot, others) |
Slurp/2.0 (slurp@inktomi.com;
http://www.inktomi.com/slurp.html) |
*.inktomi.com
such as: j2001.inktomi.com
or j10.inktomi.com |
Infoseek
(normal spider) |
InfoSeek Sidewinder/0.9 |
*.infoseek.com
such as: wilbur-bbn.infoseek.com
or
IP number
such as: 204.162.98.90 |
Infoseek
(instant spider) |
Mozilla/3.01 (Win95; I) |
as above |
Lycos
(regular spider) |
Lycos_Spider_(T-Rex) |
lycosidae.lycos.com
or
*.pgh.lycos.com
such as: spider3.srv.pgh.lycos.com |
Lycos
(Add URL spider) |
Lycos_Spider_(T-Rex) |
*.sjc.lycos.com
such as: sjc-fe4-1.sjc.lycos.com |
| Northern Light |
Gulliver/1.2 |
taz.northernlight.com |
| WebCrawler |
Served by Excite spiders |
Served by Excite spiders |
Your Best Clue: robots.txt
Start your search with a review of requests for the robots.txt file. This is a file that tells robots what they may and may not index within a site. Not all spiders follow the robots.txt convention, but most do. Anything requesting this file is almost certainly a spider, robot or an agent.
By reviewing the requests, you can usually spot spiders from the major search engines by their host names, which in turn tells you the latest agent names. You'll probably be surprised to see how many smaller search engines, personal agents and other robots are also accessing your site.
Review this information. You will be able to start to see trends for the search engines which are regularly paying you a visit.
Don't feel like tackling this yourself!
Massive Targeted Traffic Guaranteed
Amazing Formula Allows You To Drive ALL The Targeted Website Traffic You Could Ever Possibly Want!
Alou Web Toolkit - Copyright © 2006 - All Rights Reserved
Produced and Published by Alou Web Design
|