A fast alternative to HTTP::head ?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I have tens of thousands of links. All links will redirect visitors to
another URL (the real URL). I want to know the real URLs behind those
redirecting URLs.

I just wrote a small piece of code to do it:

        $result = HTTP::head($link);
    if (PEAR::isError($result)) {
        echo "Error: " . $result->getMessage()."<br>";
    } else {
        echo "\t".$chunks['2'].'<br>';

It works, but it is simply too slow to get the real (destination) URLs
for tens of thousands of redirecting URLs.

Is there a fast way to do that?


Re: A fast alternative to HTTP::head ?

Quoted text here. Click to load it

php doesn't seem your best bet unless you run the code multiple times
simultaneously to simulate multiple threads.
what OS are you using? divide up the workload into 20 and run 20
copies of the code, or 9 if you are xp sp2.

Re: A fast alternative to HTTP::head ?

Ming wrote:

Quoted text here. Click to load it

Well, one slow server will delay the whole thing. You might want to speed
it up by using concurrency: i.e. you have a queue of tens of thousands of
URLs which need "handling", and several "handlers" which each run a loop
requesting a URL to resolve, resolving it and then storing the result.
You'll also need one thread to be a "queue manager" and one to be a
"result storer".

Overall, as DNS and HTTP can be quite a slow business, I'd recommend about
12 handlers, one queue manager and one storer. The queue manager and
storer can be a SQL database server if you like!

Now, technically PHP is capable of doing this, but some other languages,
like Perl and C are a bit better for writing multi-threaded applications.

Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.12-12mdksmp, up 3 days, 11:56.]

                      A New Look for TobyInkster.co.uk

Site Timeline