« Google AdWords Rolls Out Keyword Status Changes | Keyword Research Guide »

September 20, 2005

Controlling Search Engine Robots With robots.txt and Other Methods.

By Scott Goodyear

So what is a "robots.txt" file?:

When a search engine visits a web site through a submission or when following a link from site one site to another, the search engine robot (also known as a "spider") will look for a text file called robots.txt. The file normally resides in the root directory of the site such as "www.site.com/robots.txt". This file will give instructions to spiders that might visit your site regarding what folders or files the spider(s) may or may not visit. With a correctly set up robots.txt file in place, files that are made available to a normal web surfer while can sometimes be kept hidden from a search spider. This can be useful if you are trying to conserve bandwidth (data transfer) since some engines will completely skip files and folders indicated with robots.txt, if you need to keep certain private files from being indexed like data bases, stock images, if you want to link to another site with out promoting for ranking purposes, etc.

Creating a robots.txt file:

You need not agonize over how to create a robots.txt file as they extremely simple to make and implement.

First off, a robots.txt file is really just a simple text file that can be created via the standard notepad.exe text editor in Windows, or TextEdit in plain text mode on a Mac. You can even create a robots.txt file under a unix command line. In any of these cases you will want to make sure that the file is saved as (all lowercase) robots.txt and that it is saved under a normal text mode. Using a more complex program will result in a file that also includes formatting information such as font type, font size, etc. which is not needed or read by a search engine.

The three most common items you will find in a robots.txt file are:

  • allow
  • disallow
  • and the wildcard or asterisk: "*"

Normally you would use the "disallow" command so that an engine not index certain areas of your site, while the "allow" command is actually redundant since they will usually follow any other link that you have not prohibited. Finally the wildcard indicates all engines thus if you had a file folder called "images" under the main directory such as: "www.site.com/images/" you might use the following coding if you wished to disallow all spiders from that folder:

User-agent: *
Disallow: /images/

If you wanted to disallow a robot from a particular set of folders, you would use a robot's name rather than a *. You can even specify individual files. For example:

User-agent: MSNBot
Disallow: /gopher/solutions/

User-agent: Googlebot
Disallow: /beta/private/new_widget_ideas.asp

Tip 1

Pay special attention to the slashes used in the disallow line. Files such as "images.jpg" or "images.html" would also be omitted if the disallow line was:

Disallow: /images

When you really meant to block a folder not individual files as in:

Disallow: /images/

 

Tip 2

Keep an eye on your site's log files. Some engines have multiple spiders that index for different functions of their site. For example you may disallow Googlebot from indexing images from your /images/ folder so that they will not show up under a www.google.com search but did you also disallow the robot called Googlebot-Image which collects images for Google's image search at http://www.google.com/imghp?

 

Other methods of controlling robot access:

In some situations, you might not have the ability to create a robots.txt. There are many affiliate programs where you pay for an 'in the box' solution where, in reality, your company may simply be a folder off of someone else's domain. An example of this would be "www.somecompany.com/your_businessname/" where your 'site' is the folder "/your_businessname/"rather than a stand alone domain like "www.your_businessname.com". Since search engine robots normally use only the robots.txt file found under the domain rather than from a folder such as "www.somecompany.com/your_businessname/robots.txt" you may need to use a bit of HTML to accomplish the same goal. However the code will only be used to control how the engine treats the single page in question.

Meta tags

In the meta tags area of your HTML, you can add coding such as:

<META name="ROBOTS" content="NOINDEX, NOFOLLOW">
This tells the robots not to index the page or follow links from that page. However if the robots finds other pages that are linked from other areas of your site, a submission, a link from another site to that page, etc. the pages that do not include the meta tags may still be indexed.


noindex.gif

<META name="ROBOTS" content="NOINDEX">
This tells all robots to not index the page.

<META name="ROBOTS" content="NOFOLLOW">
You can also tell them to not follow any links on the page:

<META name="ROBOTS" content="NOINDEX, FOLLOW">
This tag indicates that you might want the robot to follow links on a site map but not index the site map web page itself.

<META name="ROBOTS" content="INDEX, NOFOLLOW">
In unique situations, you might want to link to a site but you might not want the search engine to see your site as 'promoting' the site you are linking to.

HTML

Since the meta tag examples above tend to control the whole page, you can instead opt to leave out the meta tag control from your HTML and micro-manage all links or individual links on a page using relevance tags.

This is a normal link to someothersite.com where your site can be counted as "voting" or approving of the some other site page in order to improve the rank of the destination page for the term "blue widgets":

<a href="www.someothersite.com/blue.htm">Blue widgets sold here.</a>

In this example, you might link to the other site, but you do not want a search engine robot to count the link for the sake of rankings:

<a href="http://www.yetanothersite.com/yelllow.htm" rel="nofollow">The truth about yellow widgets.</a>

This is especially useful if you run a forum or have an internet diary or blog. While many of these types of sites welcome comments and allow for links to relevant sites/visitors, there has been a huge number of comment spammers as of late who do not contribute in a positive or constructive way. They often leave a message about their product/service/site and a URL link rather than adding the site or pages conversation. So this can be a powerful tool to block the spammers and in some cases you can remove it for accounts that register with your site, those that you trust, or disallow all posters from receiving link rankings from your site.

Can these techniques be used to "game" the engines?

In the past, it was common place for site owners to create sets of documents that were nearly identical in nature but geared toward one engine or another since search engines have different ideas about what info a 'relevant' page should contain. Thus they might block one engine from seeing duplicated content that was geared toward another engine. This was a smart way of doing things in the past unfortunately while search engines attempt to play by the rules, all of these control methods are completely voluntary. While most engines will honor your request to not index the pages or files you do not want indexed, some may still view the content and store it for their own reasons. For example they may choose to compare pages on your site to look for duplicate content and thus cut down potentially spammy content or a whole site altogether.

In Summary

Use robots.txt, the robots meta tag, and rel="nofollow" with a bit of care and you can control how robots interface with your site quite a bit. There were ways in the past to potentially trick the engines however this is becoming less and less worth while as the search engines evolve. While some sites have been able to get away with these action (since no engine is perfect, yet), we recommend against taking these types of actions due the the potential risks outweighing the potential gains. Instead, we say, play by the rules. Pick and choose who you promote to, which pages you promote to whom, and create themed/grouped areas of your site in order to attain better rankings. If you need to cross link to different areas of your own site which have vastly different topics i.e. if you sold both "maple solid acoustic guitars" and "green silk plants" you might use the nofollow tag. This is a better way to provide good navigation to your customers with out confusing the engines about what topics/keywords/phrases that different areas of your site might be trying to promote and rank on. And if you really, truly do not want an engine to see, index, visit, etc. certain areas or file on your site, password protect them or keep them off line.

Digg.com    del.icio.us    furl.net    newsvine.com    reddit.com    Yahoo! Myweb   ← What is this?

Read more articles in the General SEO Tips topic category.

« Previous | Next »