« Microsoft Rewards Researchers in the Field of Search | Google's New Adsbot »
June 10, 2006
The Google Supplemental Index
By Scott Goodyear
A couple of years ago, Danny Sullivan of Search Engine Watch had noted that in the competition for who had the largest index of web sites, Google had a new feature called the "supplemental index". This supplemental index contains many pages from sites that will never see the light of day during a typical search. These pages are found only when the search is so unusual or narrow that only pages from the supplemental index seem to be a match for the query. Since the "Big Daddy" update began in November 2005, many web masters are finding that while Google continues to crawl through their pages, and that updates to their pages do seem to be in Google's cache, still more and more of their pages are being relegated to the supplemental index.
Officially, the supplement index contains pages from sites that "...are part of Google's auxiliary index. We're able to place fewer restraints on sites that we crawl for this supplemental index than we do on sites that are crawled for our main index. For example, the number of parameters in a URL might exclude a site from being crawled for inclusion in our main index; however, it could still be crawled and added to our supplemental index..." Basically, sites or pages that fall into the supplemental index are those that do not meet the criteria that Google uses for their normal index.
As SEO Jim Boykin of webuildpages.com explains on his blog, these sites often have duplicate content taken from others, no real content - as he gives the example of the fake 'directory' sites that create a million virtually generic pages, and what he calls 'orphaned pages' which are those page that are not linked to from the rest of the website in question or linked to from other sites on the web.
Based on the customers that have been contacting WebPosition seeking assistance, we can agree with Boykin's observations. Quite often 'turn key' web sites like those in the real estate, nutrition, work from home, or other industries that rely heavily on duplicated cookie-cutter web sites find that they have always had a difficult time obtaining rankings but more so with this recent Google update. Often these sites have an address like:
myaccountname.someturnkeywebcompany.com,
someturnkeywebcompany.com/myaccountname,
or sometimes even - myaccountname.com*.
(* But if you were to check they actually use a URL forwarding service or frames that redirect a visitor's browser behind the scenes to myaccountname.someturnkeywebcompany.com
or someturnkeywebcompany.com/myaccountname.)
Where they had once relied greatly on inbound links and link exchange services to artificially bolster their rankings, it appears that with more emphasis on the actual content of sites to drive rankings, the lack of providing actual unique content really hurts these sites when competing with sites that have original, unique content and several relevant inbound links.
This does make a bit of sense. When you go on the web to look for a book by a given author, with a specific title, you don't want to hit 600 affiliates from a single online book store that has the same exact web page content on the same book as the affiliates. You really want to see a variety of information such as a few book stores, a few reviews of the book, perhaps information about related books and movies, a bio on the author, etc. So an attempt to clear out as many duplicates as they can find can be useful for the end user but perhaps painful to those that are trying to sell an item or promote a service with little to no hands on activity/customization on their web site. And thus many sites are pushed into the supplemental index because they do not offer any thing that is compellingly different.
You can often compare the number of pages listed in Google vs. the number of pages that Google has with out many of the supplementals by using the following 2 searches:
search 1)
site:www.thenameofyoursite.com
search 2)
site:www.thenameofyoursite.com inurl:www.thenameofyoursite.com
See which pages are listed in the first search but not in the second. Work on those pages that are omitted from the second search. If listed in the supplemental index, they will have a look similar to:
| Title of the Page The generic description
of the page, product, service, etc. on the page that is extremely similar or
identical to other pages on the web. ... somesite.com/pages/somethingnotveryoriginal.htm - 12k - Supplemental Result- Cached - Similar pages |
As we've mentioned in the past, it is also a good idea to make sure that your site is consistent in how your pages are presented. For example you should choose to consistently use the www or non www version of links through out your site and only allow one type of link to exist. A 301 redirect can help with this. While it may seem beneficial to show the same web page no matter if your customer types in:
thenameofyoursite.com
www.thenameofyoursite.com
thenameofyoursite.com/index.htm
www.thenameofyoursite.com/index.html
thenameofyoursite.com/home.asp
etc.
Engines sometimes see these as separate pages and can create a duplicate content penalty for your pages.
From time to time there can be a legitimate need to provide duplicate content. You may notice that at MarketPosition.com we often quote information from other sites, reprint entire articles (with the permission of the content owners), etc. There is certainly an invisible line in the sand with exceptions to the rule that boggle the mind, but the general theory is that you can present duplicate content provided that your site's content, overall, is very unique, with information that other sites do not offer. This is why sites that mainly serve as aggregators or hubs with little to no original content bounce around a bit more in the rankings. A search engine can try to check the age of a document, facts like when was content resembling XYZ first indexed, what site was the content first attributed to, and other factors. In going through this process, sometimes it is the original author who is attributed with the content being first published and yet other times it can be the aggregator depending on who appears to be the first publisher.
If you are still in the supplementals and you are positive that none of these factors are affecting your pages.... there are a few initial reports that Google is mistakenly merging data from old listings and new listings, that they may have had a 'bad data push' in which site: queries may be incorrect, as well others that report that titles and other information about some pages had been incorrectly merged from much older listings in Google (perhaps a data merge between old and new databases?) rather than just listing even an older DMOZ title, which could be causing a temporary problem and/or that this may simply be an indexing experiment that has affected some sites negatively. For more on this, check out:
Seroundtable.com
Webmasterworld.com
Searchenginewatch
Jimboykin.com
In summary, make sure that your content is as unique as it can be. Do this not only for your listings but also for your site visitors. Check to find any problem pages by comparing your listed pages through the two site: commands. Finally, if you use duplicated content - try to weight things out so that the majority of your content falls on the original content side of the scale. These tips can help your site overall and may help you to avoid finding your pages in the supplemental index.
← What is this?
