Click here to get back home

Warning: robots.txt unreliable in Apache servers

 HomeNewsGroups | Search | About
 comp.infosystems.www.authoring.html    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Warning: robots.txt unreliable in Apache servers Anonymous, quoting Philip Rona 10-30-2005
Get Chitika Premium
Posted by Anonymous, quoting Philip Rona on October 30, 2005, 1:10 am
Please log in for more thread options





Subject: Warning: robots.txt unreliable in Apache servers
Newsgroups: alt.internet.search-engines
Date: Sat, 29 Oct 2005 23:07:46 GMT

Hi,

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation" pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.

Phil

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/


Posted by Nick Kew on October 30, 2005, 9:34 am
Please log in for more thread options


Anonymous wrote:

> For some reason, the Apache developers decided to treat multiple consecutive
> forward slashes in a request URI as a single forward slash. So for example,
> <http://apache.org/foundation/> and <http://apache.org//////foundation/>
> both resolve to the same page.

Yep. If you apply filesystem semantics to that, you have a whopping
great security hole. Of course you could just return "bad request",
but that just transfers the risk leaving server admins to shoot
their own feet.

There was a story in TheRegister a couple of weeks ago about someone
who got a criminal conviction (for attempted unauthorized access)
after he requested a url like that and it triggered an intrusion
detection alarm.

If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.

--
Nick Kew

Posted by Stan Brown on October 30, 2005, 8:06 am
Please log in for more thread options


Sun, 30 Oct 2005 09:34:36 +0000 from Nick Kew
> If you have links to things like "////" and dumb robots, put the
> paths in your robots.txt. Don't forget that robots.txt is only
> advisory and is commonly ignored by evil and/or broken robots.

Wouldn't it be more effective to have any URL containing http://.*//
return a 403 Forbidden or a 404 Not Found? This could be done in
.htaccess or perhaps httpd.conf. I may be having a failure of
imagination, but I can't think of any legitimate reason for such a
link.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/2003/05/05/why_we_wont_help_you

Posted by Philip Ronan on October 30, 2005, 2:44 pm
Please log in for more thread options


In comp.infosystems.www.authoring.html, "Stan Brown" wrote:

> Wouldn't it be more effective to have any URL containing http://.*//
> return a 403 Forbidden or a 404 Not Found? This could be done in
> .htaccess or perhaps httpd.conf. I may be having a failure of
> imagination, but I can't think of any legitimate reason for such a
> link.

That would also be effective, but maybe it's better to do something useful
with the URL if you can.

Most servers will redirect to a URL with a trailing slash when the name of a
directory is requested. Why not treat multiple slashes in a similar way?

Besides, it might help in terms of page rank.

[[Crossposted to alt.internet.search-engines, with apologies to Nick]]

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/


Posted by Philip Ronan on October 30, 2005, 11:15 am
Please log in for more thread options


"Nick Kew" wrote:

> If you have links to things like "////" and dumb robots, put the
> paths in your robots.txt. Don't forget that robots.txt is only
> advisory and is commonly ignored by evil and/or broken robots.

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution. I realize not all
robots bother with the robots.txt protocol, but if even the legitimate
spiders can be misdirected then the whole point of having a robots.txt file
goes out the window.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/


Similar ThreadsPosted
Web service -> IIS and Apache rendering November 15, 2004, 10:59 am
Apache Content negotiation with Chinese zh-TW and zh-CN August 4, 2006, 3:11 pm
file download using apache on HPUX August 10, 2007, 7:53 pm
mime types in Apache and the validator June 6, 2008, 7:58 am
Unknown Parse Mode! warning from w3c validator with custom doctype September 6, 2005, 11:23 pm
Robots.txt issue January 13, 2005, 12:02 pm
More on robots.txt / spam February 14, 2006, 1:28 pm
Guestbook spam en robots.txt January 26, 2006, 4:45 pm
robots.txt being read by people January 31, 2007, 7:35 pm
robots.txt - as previously posted on alt.html January 17, 2006, 12:46 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap