Click here to get back home

blocking robots.txt from non-robots

 HomeNewsGroups | Search | About
 alt.internet.search-engines    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
blocking robots.txt from non-robots Joe Fox 02-20-2008
Posted by Big Bill on February 21, 2008, 6:54 am
Please log in for more thread options
wrote:

>
>>
>>>
>>>>
>>>>> I'm using a robots.txt file to control what is and is not crawled
>>>>> by search engine bots but I'd like to block anything that isn't a
>>>>> known search engine bot doesn't get the file I'm feeding to google,
>>>>> yahoo and the others.
>>>>
>>>> Why?
>>>>
>>>> I can imagine that you want to block your entire site for any bot
>>>> that's known to be abusive though, but those probably don't check
>>>> your robots.txt anyway.
>>>>
>>>
>>> Perhaps I didn't say it right. I'm wanting to block the robots.txt
>>> that I'm feeding search engines from being given to anybody else.
>>
>> Why? If the reason is that you want to "protect" some folders: it's
>> not secure and bound to fail sooner or later. Remember that not all
>> bots honor the robots.txt, especially not the ones that you don't want
>> on your site in the first place.
>
>I want to keep certain humans from reading the robots.txt that I give to
>search engines because it's none of their bloody business what pages I
>tell SE's not to index and there are a few that might have mind enough to
>look at robots.txt They will not however expect to be handed a tailored
>version of it.
>
>>> I
>>> realize that they *could* spoof the SE's user agent or something, but
>>> my concerns are bright enough to look for robots.txt but not bright
>>> enough to expect to be handed a phoney
>>
>> You want to hide the key under the doormat which has in 5 languages
>> "The key is hidden nearby" written on top...
>
>Not really, or is it possible that they could also get my .htaccess? I
>didn't think that was possible. If they ask for a robots.txt and get one
>that's got nothing more than a pointer to a sitemap that will satisfy
>'em.

Essentially you'd need to claok, feed different content to different
requests. Ask Fantomaster.

BB
--

http://www.kruse.co.uk/
http://www.fat-odin.com/
http://www.here-be-posters.co.uk/

Posted by John Bokma on February 21, 2008, 10:55 am
Please log in for more thread options

> Not really, or is it possible that they could also get my .htaccess?
> I didn't think that was possible. If they ask for a robots.txt and
> get one that's got nothing more than a pointer to a sitemap that will
> satisfy 'em.

Let's assume for arguments sake that those people *want* to see your
robots.txt. If you feed Google something different than them, they will
notice as soon as they check Google, because if you disallow Google some
directories, while your robots.txt says allow, they will wonder why all
pages in some directory don't show up in Google, but are available on your
site.

I really don't get why you want to hide your robots.txt.

--
John Bokma http://johnbokma.com/

Posted by Joe Fox on February 21, 2008, 5:00 pm
Please log in for more thread options

>
>> Not really, or is it possible that they could also get my .htaccess?
>> I didn't think that was possible. If they ask for a robots.txt and
>> get one that's got nothing more than a pointer to a sitemap that will
>> satisfy 'em.
>
> Let's assume for arguments sake that those people *want* to see your
> robots.txt. If you feed Google something different than them, they
> will notice as soon as they check Google, because if you disallow
> Google some directories, while your robots.txt says allow, they will
> wonder why all pages in some directory don't show up in Google, but
> are available on your site.

They may wonder but that's it.

>
> I really don't get why you want to hide your robots.txt.
>

How about this idea?

Given: I want to hide robots.txt from the public in general and a certain
group in particular.

Given: You and others here don't see the need or point in my doing this.

Given: Ultimately, if there's a potential penalty for doing this then
that's mine to risk.

Can we simply agree to disagree and save discussion of *why* for another
time and go into some details about *how*?

Thanks

Posted by John Bokma on February 21, 2008, 6:58 pm
Please log in for more thread options

> Can we simply agree to disagree and save discussion of *why* for
> another time and go into some details about *how*?

Welcome to Usenet. Remember people try to help you in *their* spare time,
for *free*.

That being said: there are two ways that might do what you want:

1 IP address based: you have to find out the IP address ranges
each bot you want to allow.
2 UserAgent string based: you have to find out each UA string for
each bot you want to allow.

In .htaccess you can redirect internally using either 1 or 2 to the right
robots.txt.

--
John Bokma http://johnbokma.com/

Posted by Joe Fox on February 21, 2008, 8:23 pm
Please log in for more thread options

>
>> Can we simply agree to disagree and save discussion of *why* for
>> another time and go into some details about *how*?
>
> Welcome to Usenet. Remember people try to help you in *their* spare
> time, for *free*.

Yes, that's right, they do. and I have always apprecieated the help and
input I find on UseNet and other sources. That's why I didn't see the
need for a big deal about why. I didn't see a need to waste people's
time with *why*.

Now if I were trying to convince folks to do something like this on their
servers & sites (which I wouldn't... not my business), that would be
another matter entirely and I'd have to come with a truckload of *why*
and it'd better be bloody convincing at that.


> That being said: there are two ways that might do what you want:
>
> 1 IP address based: you have to find out the IP address ranges
> each bot you want to allow.
> 2 UserAgent string based: you have to find out each UA string for
> each bot you want to allow.
>
> In .htaccess you can redirect internally using either 1 or 2 to the
> right robots.txt.

Thank you very much for a useful answer.

Sorry if I've come off like an ass. Real life is intruding. Not a good
excuse I realize but just when you think you've got enough to deal
with....

Similar ThreadsPosted
whitehouse.gov is blocking " February 2, 2007, 11:51 am
Semi-OT :How Do I Know If My ISP Is Blocking Pages? August 14, 2007, 10:15 pm
Question about testing for page blocking January 9, 2005, 1:30 pm
Google blocking our Web Position Software March 7, 2005, 10:18 am
Yahoo has been blocking SeoElite's queries January 8, 2006, 8:52 pm
Yahoo has been blocking SeoElite's queries January 8, 2006, 8:55 pm
robots.txt January 12, 2005, 11:56 pm
Robots txt March 20, 2006, 8:19 am
robots.txt April 12, 2006, 8:48 am
Robots.txt April 17, 2006, 6:43 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap