« Google Reaches Out to Webmasters with New Tools | Create Personalized Search Engines with Live Search Macros »

October 12, 2006

Basic Security Considerations for SEO

by Curtis Friedl and Scott Goodyear
www.marketposition.com

When every day business owners search for a means to improve sales, they often start by examining their sales force. Sales people are trained on the latest sales techniques, or provided incentives to push the higher margin products to customers. While many owners often overlook their web site as their best sales person, SEOs understand that looking at keywords, weighting, and placement in addition to links and excellent navigation are important just as important to their virtual sales person; their website.

This SEO effort sometimes over looks the need for information security. Business plans, photos, product descriptions, and customer information are often left less protected then they should be. As part of creating and training our best sales tool, we need to look at how much information websites give to search engines, and ask "Are our sites giving out information that they should not?" Are they providing your mission critical information to your competition? In this article we intend to provide a few examples of basic security considerations that every SEO or business owner should keep in mind.

HTTPS:// vs. HTTP://

As we've said in the past, you can use a robots.txt file to help control search engines. However some engines like Live.com's robot (MSN) actually requires that you have a robots.txt file for both the secure https:// and unsecured http:// versions of your site (really each folder on your site that can be visited by an engine may need its own robots.txt file). While we've heard some rumblings that search engines other than MSN can index sites in both secure and unsecured fashions, it is not entirely clear if this could cause a duplicate content issue. The arguments that content from both the secure and unsecured versions of your site could cause a 'duplicate content' issue tend to lead one to want to play on the safe side. This is especially true of Live.com as it does not appear to have an obvious supplemental index as Google displays.

Some engines like Yahoo state that they will not index information from secure areas of your site:

"...There are several ways to prevent our crawler from indexing your site or portions of your site:
* create a "robots.txt" file on your web site to prevent our crawler from indexing your site
* add a "noindex" meta tag to your documents
* remove the original document from your web site
* host the document on an access restricted section of your web site..."

On some of the regional versions of Yahoo the statements clarify this a bit by stating:
"...* host the document on a secure section of your web site (HTTPS or login)..."

How does Google treat secure https:// pages? At this time, they are indexed:

Google currently indexes secure web pages

So, as you can see, there is a problem with secure pages. If you place content on them that you do want indexed, this content may not be indexed by all engines, and the opposite is true as well, just because your page is secure, this does not mean that your secure content is kept out of some engines results.

Some engines treat the secure pages as an entirely different site. Some engines index secure pages. Secure pages and the use of the https protocol are important for Webmasters, and site owners to consider. Protecting the privacy of your customer's information is of paramount importance for any business; however it does not mitigate the need for the site to be indexed by the search engines. Lest we forget if a customer can not find you, you will not have a customer.

Controlling Access

Is your site really 'secure'? Even sites thought to be 'secure' are vulnerable to accidental intrusion by search engine robots. A North Carolina school district found out the hard way that their site had both secure and non-secure information available in a semi-unsecured area online. Social security numbers and test scores appeared in web results after Google was able to index their site's secure content. Originally this was sensationalized as if Google may have 'hacked' the website. Like Search Engine Watch, we think that it is more likely that a student had logged in to view their test scores online. The student then posted a direct link to this 'secure' area perhaps on a personal website, which Google subsequently spidered. They might have avoided this issue by using a robots.txt to disable a spider from indexing content in particular areas of their site. Alternately they could have used .htaccess and rewrite engine techniques to change a url such as user:password@somesite.com/logged/in/testscores to something more like www.site.com/testscore-login. A URL like "www.site.com/testscore-login" would prevent direct linking to the scores pages if there was not a cookie, active session, etc. in the browser attempting to view the page.

Bandwidth Security

In the 1990's, bandwidth constraints were a major issue to businesses large and small. Today, it can still be a very problematic issue for many small businesses as many pay for bandwidth exceeding a certain level per day/week/month. An .htaccess file can help prevent bandwidth theft. We often hear from site owners who are upset that their site's graphics and other files are used as icons for forums on other websites, to decorate MySpace.com pages, etc. You may wish to read up on using .htaccess to prevent image bandwidth theft. While preventing this theft can save you some cash, a popular site can sometimes be slowed down or even pushed offline depending on their web hosting resources. If a search engine spider visits a site while it is offline, this can throw your hard earned rankings out the window. When your site is back online, most engines will re-visit and self correct the issue however whether this takes days, weeks, or months is up to the engines. Can you afford to wait on this?

Additionally server configuration tools, (like a mod_rewrite) can help by displaying one URL to site visitors while the actual content exists elsewhere on your site. Thus an attempt by others to link and display this content off site may be thwarted. Depending on your hosting package you may even have an easily configurable 'leech protection' configuration script to turn on/off.

The Secure Server and the Certificates

One part of displaying to your visitors that your site is secure requires the creation, and installation of a secure server certificate. This security certificate on its own does not make your site secure but it can be reassuring to customers if they go from an unsecure to secure area of your site or move from a normal website to a third party site like PayPal, Yahoo Stores, or similar to complete a secure transaction. The purpose of the certificate is to certify that the website being viewed, or organization that employs the certificate is who they claim to be. The certificate contains information about the owner, expiration date, and how it can be validated with the issuing party.

There are two methods by which you can obtain a secure server certificate, first and recommended is to buy one through a company like Verisign, or a subsidiary of their's Thawte. The second is to create your own; this is not very difficult however it does involve you signing the certificate yourself. This sprouts additional issues, and will requires additional steps to be executed prior to all browsers viewing your certificates in the same light as these two companies.

In addition to the server certificate you need to have a mechanism in place to provide a secure interface between the originating web server and the client, examples are Apache Stronghold, Apache SSL, Windows SSL, etc.

Conclusion

Remember in our discussion above where we mentioned Yahoo does not index content under the secure protocol, while Google does. SEOs need to be conscientious of where both public and private content is stored on their site, and how links to the content from within, and from outside can affect security. Providing reassurance to customers that they are still dealing with the site that they intended to transact with can build confidence in your site.

Each of these solutions will vary in ease of use; each can provide you with some basic starting points in both securing your content and helping to keep the right content publicly accessible by the search engines. The main focus of search engine optimization is to understand search engines and work to create or adjust content so that web pages rank well. Success in this endeavor usually ensures a business' success or failure online. But going the extra step to make sure that right content is made public can be just as important.

Digg.com    del.icio.us    furl.net    newsvine.com    reddit.com    Yahoo! Myweb   ← What is this?

Read more articles in the Things to Avoid topic category.

« Previous | Next »