Robots.txt help

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I have a site that uses PHP Sessions IDs.. I know that total
elimination of these from the URL is what is recommended for optimal
bot crawling and I am working on that, but is there any way, for now to

include a line in robots.txt that would ignore the "PHPSESSID"

For example, the site works just fine when you visit this page:

But by default it generates a URL like this: ...

What can be done right now so that Google doesn't crawl these session
IDs and then store them and want to come back to them? Thanks in
advance for your help. BTW, I don't want to disallow all
"search_details.php" URLs..

Re: Robots.txt help

__/ [ danish ] on Monday 18 September 2006 17:05 \__

Quoted text here. Click to load it

Hi, this would probably be handled well by alterring the generation of URL's
in the CMS, either by omitting these duplicates or moving them to a
(virtual) directory structure so that robots.txt can exclude them (it
can't/shouldn't do wildcards, but Google is pushing towards
breaking/'extending' the standards and conventions).

Session ID's are tricky. Are you sure bots are being assigned a cookie? I
know that spyware-type tools will be passed such URL's, but I don't think
search engines will browse (crawl) with a cookie. There were similar
questions before in this newsgroup (sessionid and duplicates), so it's
definitely worth browsing the archive. It's also worth looking at the logs,
filering by crawler type (or IP address) to see what is going on underneath
the surface. Another possibility is to view the cache, e.g. using

Best wishes,


Roy S. Schestowitz      |    /earth: file system full  |    SuSE Linux     |     PGP-Key: 0x74572E8E
  5:20pm  up 60 days  5:32,  7 users,  load average: 0.40, 0.54, 0.64 - Open Source knowledge engine project

Re: Robots.txt help

__/ [ Roy Schestowitz ] on Monday 18 September 2006 17:29 \__

Quoted text here. Click to load it

Addendum: the following has just been published.

                Session ID's Make Ecommerce Difficult

It might help.

Roy S. Schestowitz      |    Community is code, code is community  |    SuSE Linux     |     PGP-Key: 0x74572E8E
  6:20pm  up 60 days  6:32,  7 users,  load average: 1.06, 0.82, 0.77 - Open Source knowledge engine project

Re: Robots.txt help

danish wrote:

Quoted text here. Click to load it

Why not get rid of the session IDs ASAP?

Get rid of session ids with your .htaccess (if using mod_php, I believe), or
with a php.ini file.  (i.e., use_only_cookies)

Google allows you to block spidering of dynamic URLs with robots.txt, but I
don't know if the other search engines obey it.  I don't think that would
work because your normal URLs are dynamic.

Re: Robots.txt help

__/ [ z ] on Monday 18 September 2006 19:36 \__

Quoted text here. Click to load it

In Google's Guidelines for Webmasters they specify a possible substitution of
symbols that avoid replication (maybe ampersand?). But surely, it's not
standardised and requirement from different SE's can differ. That's why, as
you say, it's better to go for a universal solution.

Roy S. Schestowitz      |    "Double your drive space - delete Windows"  |  Open Prospects        PGP-Key: 0x74572E8E
Tasks: 111 total,   2 running, 108 sleeping,   0 stopped,   1 zombie - knowledge engine, not a search engine

Re: Robots.txt help

Quoted text here. Click to load it

Hi Danish Iqbal,
Yes and no. No for a standard robots.txt file but yes if you're only
worried about Googlebot. Googlebot supports wildcards in robots.txt:

Note that this is their own extension to the robots.txt standard so it's
pretty likely that at least some or all of the other major Web spiders
do *not* respect wildcards.


Philip /
Whole-site HTML validation, link checking and more

Site Timeline