« Web Site Usability: Increase Your Site's Effectiveness with Good Design | WebTrends Acquires ClickShift, Announces Dynamic Search Product »
December 07, 2006
How to Identify Legitimate Search Robots
by Richard Drawhorn
Everybody knows that search engines use search robots to locate and index content, and webmasters should certainly allow them to do so. However, given the fact that not all robots are harmless, how do you validate that a robot is authentic? This topic has come up on the search blogs for both Google and MSN recently, and in this brief post I'll summarize their advice on how to identify their robots.
When viewing your server logs, you'll find entries for each visit to your web site and the corresponding IP address. If the visit was a search robot, it will have a user-agent entry like Googlebot or MSNBot for example. Each search engine has its own user-agent, but these are not sufficient to identify the robot because any spammer can name their robot Googlebot if they so choose.
I took at look at the documentation posted by Google, MSN and Ask.com, and they all agree that the best way to identify a search robot is as follows:
- First do a Reverse DNS Lookup to confirm the hostname associated with the robot's IP address. There are several free tools (like this one) on the web that can be used to do the reverse DNS lookups, but you'll probably find it easier to use command line tools to gather this information.
- Verify that the hostname is correct:
- For Google, it should be in the googlebot.com domain (such as crawl-66-249-66-1.googlebot.com for example).
- For MSN, it should be in the search.live.com domain (such as livebot-207-46-98-149.search.live.com for example).
- For Yahoo!, it should be in the inktomisearch.com domain (such as ab1164.inktomisearch.com for example).
- Finally, do a Forward DNS Lookup on the hostname to confirm that it matches the IP address. This last step will verify that the name itself is accurate.
If you do find a robot that has been been disguising itself as a legitimate search engine crawler, you'll probably want to block that robot's access to your web site (which can be easily achieved by configuring your web server appropriately).
Conclusion
Verifying that robots visiting your web site are authentic search robots is in any webmaster's best interest. Many illegitmate robots do not follow the conventions defined by the robots.txt file and can cause performance issues for a web site if allowed to spider the site unimpeded.
← What is this?
