Controlling how search engines access and index your website 1/26/2007 11:36:00 AM

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Controlling how search engines access and index your website

1/26/2007 11:36:00 AM

Posted by Dan Crow, Product Manager
I'm often asked about how Google and search engines work. One key  
question is: how does Google know what parts of a website the site  
owner wants to have show up in search results? Can publishers specify  
that some parts of the site should be private and non-searchable? The  
good news is that those who publish on the web have a lot of control  
over which pages should appear in search results.
The key is a simple file called robots.txt that has been an industry  
standard for many years. It lets a site owner control how search  
engines access their web site. With robots.txt you can control access  
at multiple levels -- the entire site, through individual directories,  
pages of a specific type, down to individual pages. Effective use of  
robots.txt gives you a lot of control over how your site is searched,  
but its not always obvious how to achieve exactly what you want. This  
is the first of a series of posts on how to use robots.txt to control  
access to your content.
What does robots.txt do?

The web is big. Really big. You just won't believe how vastly hugely  
mind-bogglingly big it is. I mean, you might think it's a lot of work  
maintaining your website, but that's just peanuts to the whole web.  
(with profound apologies to Douglas Adams)
Search engines like Google read through all this information and  
create an index of it. The index allows a search engine to take a  
query from users and show all the pages on the web that match it.
In order to do this Google has a set of computers that continually  
crawl the web. They have a list of all the websites that Google knows  
about and read all the pages on each of those sites. Together these  
machines are known as the Googlebot. In general you want Googlebot to  
access your site so your web pages can be found by people searching on  
However, you may have a few pages on your site you don't want in  
Google's index. For example, you might have a directory that contains  
internal logs, or you may have news articles that require payment to  
access. You can exclude pages from Google's crawler by creating a text  
file called robots.txt and placing it in the root directory. The  
robots.txt file contains a list of the pages that search engines  
shouldn't access. Creating a robots.txt is straightforward and it  
allows you a sophisticated level of control over how search engines  
can access your web site.
Fine-grained control
In addition to the robots.txt file -- which allows you to concisely  
specify instructions for a large number of files on your web site --  
you can use the robots META tag for fine-grain control over individual  
pages on your site. To implement this, simply add specific META tags  
to HTML pages to control how each individual page is indexed.  
Together, robots.txt and META tags give you the flexibility to express  
complex access policies relatively easily.
A simple example
Here is a simple example of a robots.txt file.

User-Agent: Googlebot
Disallow: /logs
The User-Agent line specifies that the next section is a set of  
instructions just for the Googlebot. All the major search engines read  
and obey the instructions you put in robots.txt, and you can specify  
different rules for different search engines if you want to. The  
Disallow line tells Googlebot not to access files in the logs sub-
directory of your site. The contents of the pages you put into the  
logs directory will not show up in Google search results.
Preventing access to a file
If you have a news article on your site that is only accessible by  
registered users, you'll want it excluded from Google's results. To do  
this, simply add a META tag into the html file, so it starts something  

<meta name="googlebot" content="noindex">
This stops Google from indexing this file. META tags are particularly  
useful if you have permission to edit the individual files but not the  
site-wide robots.txt. They also allow you to specify complex access-
control policies on a page-by-page basis.
Learn more
You can find out more about robots.txt at and  
at Google's Webmaster help center, which contains lots of helpful  
information, including:

*    How to create a robots.txt file
*    Descriptions of each user-agent that Google uses
*    How to use pattern matching
*    How often we recrawl your robots.txt file
We've also done several posts in our webmaster blog about robots.txt  
that you may find useful, such as:

*    Using robots.txt files
*    All about Googlebot
There is also a useful list of the bots used by the major search  
Next time...
Coming soon: a post detailing the use of robots and metatags, and  
another on specific examples for common cases.
Update: Added a sentence to paragraph 9 on access-control policies.

Site Timeline