Click here to get back home

blocking robots.txt from non-robots

 HomeNewsGroups | Search | About
 alt.internet.search-engines    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
blocking robots.txt from non-robots Joe Fox 02-20-2008
Posted by John Bokma on February 21, 2008, 10:03 pm
Please log in for more thread options


[..]

>> That being said: there are two ways that might do what you want:
>>
>> 1 IP address based: you have to find out the IP address ranges
>> each bot you want to allow.
>> 2 UserAgent string based: you have to find out each UA string for
>> each bot you want to allow.
>>
>> In .htaccess you can redirect internally using either 1 or 2 to the
>> right robots.txt.
>
> Thank you very much for a useful answer.
>
> Sorry if I've come off like an ass.

Thanks, no problem.

Like I said, a lot of people on Usenet think they have an X problem, while
the real one is Y, so people often assume this is the case.

I also still can't see why you want to do this, but like you wrote, it's
your server:

method 1: if you miss out spiders, you might lose traffic.
hard to test (it can be done, with 2 computers + router)
method 2: if you miss out spiders, you might lose traffic
         easy to test: you can either write a Perl program
that changes the UA for each request, or check
manually with Firefox + UA switcher add-on


Untested:

RewriteCond % =UA1 [OR]
RewriteCond % =UA2 [OR]
RewriteCond % =UA3 [OR]
RewriteRule ^robots.txt$ real-robots.txt [L]

with UA1..UAn the *exact* UA plain string, e.g.
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

See: http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html

--
John Bokma http://johnbokma.com/

Posted by Don on February 21, 2008, 10:13 pm
Please log in for more thread options

>
>
> [..]
>
>>> That being said: there are two ways that might do what you want:
>>>
>>> 1 IP address based: you have to find out the IP address ranges
>>> each bot you want to allow.
>>> 2 UserAgent string based: you have to find out each UA string for
>>> each bot you want to allow.
>>>
>>> In .htaccess you can redirect internally using either 1 or 2 to the
>>> right robots.txt.
>>
>> Thank you very much for a useful answer.
>>
>> Sorry if I've come off like an ass.
>
> Thanks, no problem.
>
> Like I said, a lot of people on Usenet think they have an X problem,
> while the real one is Y, so people often assume this is the case.
>
> I also still can't see why you want to do this, but like you wrote,
> it's your server:
>
> method 1: if you miss out spiders, you might lose traffic.
> hard to test (it can be done, with 2 computers + router)
> method 2: if you miss out spiders, you might lose traffic
> easy to test: you can either write a Perl program
> that changes the UA for each request, or check
> manually with Firefox + UA switcher add-on
>
>
> Untested:
>
> RewriteCond % =UA1 [OR]
> RewriteCond % =UA2 [OR]
> RewriteCond % =UA3 [OR]
> RewriteRule ^robots.txt$ real-robots.txt [L]
>
> with UA1..UAn the *exact* UA plain string, e.g.
> Mozilla/5.0 (compatible; Googlebot/2.1;
> +http://www.google.com/bot.html)
>
> See: http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html
>

John,
Just a heads up (not critique).
The last "[OR]" is invalid.

Posted by John Bokma on February 22, 2008, 1:10 am
Please log in for more thread options


[..]
>> RewriteCond % =UA1 [OR]
>> RewriteCond % =UA2 [OR]
>> RewriteCond % =UA3 [OR]
>> RewriteRule ^robots.txt$ real-robots.txt [L]
>>
>> with UA1..UAn the *exact* UA plain string, e.g.
>> Mozilla/5.0 (compatible; Googlebot/2.1;
>> +http://www.google.com/bot.html)
>>
>> See: http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html
>>
>
> John,
> Just a heads up (not critique).
> The last "[OR]" is invalid.

:-) yup, 100% right, so much for being lazy and copying one line twice.

--
John Bokma http://johnbokma.com/

Posted by Big Bill on February 22, 2008, 12:18 am
Please log in for more thread options
wrote:

>
>>
>>> Can we simply agree to disagree and save discussion of *why* for
>>> another time and go into some details about *how*?
>>
>> Welcome to Usenet. Remember people try to help you in *their* spare
>> time, for *free*.
>
>Yes, that's right, they do. and I have always apprecieated the help and
>input I find on UseNet and other sources. That's why I didn't see the
>need for a big deal about why. I didn't see a need to waste people's
>time with *why*.

Because when people come on here asking how to do weird stuff, it's
usually because they're asking the wrong question in the first place.
As you are yourself, I think. I don't think what you suggest is
practical as you'd need to know technical info about your visitors
that I don't imagine you'd be able to.

BB
--

http://www.kruse.co.uk/
http://www.fat-odin.com/
http://www.here-be-posters.co.uk/

Posted by Don on February 21, 2008, 8:55 pm
Please log in for more thread options

>
>> Not really, or is it possible that they could also get my .htaccess?
>> I didn't think that was possible. If they ask for a robots.txt and
>> get one that's got nothing more than a pointer to a sitemap that will
>> satisfy 'em.
>
> Let's assume for arguments sake that those people *want* to see your
> robots.txt. If you feed Google something different than them, they
> will notice as soon as they check Google, because if you disallow
> Google some directories, while your robots.txt says allow, they will
> wonder why all pages in some directory don't show up in Google, but
> are available on your site.
>
>

You give the majority of the general public too much credit ;)
Comparing a websites robots.txt to google results!


Similar ThreadsPosted
whitehouse.gov is blocking " February 2, 2007, 11:51 am
Semi-OT :How Do I Know If My ISP Is Blocking Pages? August 14, 2007, 10:15 pm
Question about testing for page blocking January 9, 2005, 1:30 pm
Google blocking our Web Position Software March 7, 2005, 10:18 am
Yahoo has been blocking SeoElite's queries January 8, 2006, 8:52 pm
Yahoo has been blocking SeoElite's queries January 8, 2006, 8:55 pm
robots.txt January 12, 2005, 11:56 pm
Robots txt March 20, 2006, 8:19 am
robots.txt April 12, 2006, 8:48 am
Robots.txt April 17, 2006, 6:43 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap