Click here to get back home

Recursive download from the web

 HomeNewsGroups | Search | About
 comp.lang.perl.modules    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Recursive download from the web Ilya Zakharevich 02-04-2005
Posted by Ilya Zakharevich on February 4, 2005, 1:12 am
Please log in for more thread options
For the purpose of recursive download of documents, I need to extract
and categorize the embedded "links" of the document. Currently, I do
by a quick-and-dirty homegrown scheme, but I would prefer to do it in
a more robust way.

LinkExtor covers about a couple of dozens of different types of links,
but it makes no distinction about the very different nature of these
links:

a) Links which are just "abstract URLs", and not supposed to be ever
retrieved (base URL and many of its flavors);

b) Links which are "natural parts" of the current document: content
of these links is supposed to change the appearence of the
document in some browsers (style sheets, images, frames);

c) Complimentary links which may add positive browsing experience,
but are not essential for representation of document (favicon;
maybe more?);

d) Links which are supposed to be made "hot" in the document (user
is supposed to have a simple way to go to these links): <a>,
"prev"/"next"/"up"/"content"/"index" links, etc.

e) Other categories?

Is there a way to map a link to one of these categories without trying
to understand all of HTML 4.0 standard?

Thanks,
Ilya


Posted by John Bokma on February 4, 2005, 6:54 pm
Please log in for more thread options
Ilya Zakharevich wrote:

> For the purpose of recursive download of documents, I need to extract
> and categorize the embedded "links" of the document. Currently, I do
> by a quick-and-dirty homegrown scheme, but I would prefer to do it in
> a more robust way.
>
> LinkExtor covers about a couple of dozens of different types of links,
> but it makes no distinction about the very different nature of these
> links:
>
> a) Links which are just "abstract URLs", and not supposed to be ever
> retrieved (base URL and many of its flavors);
>
> b) Links which are "natural parts" of the current document: content
> of these links is supposed to change the appearence of the
> document in some browsers (style sheets, images, frames);
>
> c) Complimentary links which may add positive browsing experience,
> but are not essential for representation of document (favicon;
> maybe more?);
>
> d) Links which are supposed to be made "hot" in the document (user
> is supposed to have a simple way to go to these links): <a>,
> "prev"/"next"/"up"/"content"/"index" links, etc.
>
> e) Other categories?
>
> Is there a way to map a link to one of these categories without trying
> to understand all of HTML 4.0 standard?

One word: wget

Have a look at it, you can give a list of extensions to include,
exclude.

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html



Posted by Ilya Zakharevich on February 4, 2005, 10:25 pm
Please log in for more thread options
[A complimentary Cc of this posting was sent to
John Bokma
> > For the purpose of recursive download of documents, I need to extract
> > and categorize the embedded "links" of the document. Currently, I do
> > by a quick-and-dirty homegrown scheme, but I would prefer to do it in
> > a more robust way.
> >
> > LinkExtor covers about a couple of dozens of different types of links,
> > but it makes no distinction about the very different nature of these
> > links:

> > Is there a way to map a link to one of these categories without trying
> > to understand all of HTML 4.0 standard?
>
> One word: wget
>
> Have a look at it, you can give a list of extensions to include,
> exclude.

Do you mean wget source, or wget itself? If the latter, then in my
experience, it can handle about 10% of (my) downloads. This is why
I'm forced to have a replacement in Perl...

Hope this helps,
Ilya


Posted by John Bokma on February 4, 2005, 10:40 pm
Please log in for more thread options
Ilya Zakharevich wrote:

> Do you mean wget source, or wget itself?

Latter, the not perl binary.

> If the latter, then in my
> experience, it can handle about 10% of (my) downloads. This is why
> I'm forced to have a replacement in Perl...

What doesn't wget handle?

Only thing I don't like about wget is that it doesn't work in parallel. (Or
I missed that option).

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html



Posted by Ilya Zakharevich on February 5, 2005, 9:05 pm
Please log in for more thread options
[A complimentary Cc of this posting was sent to
John Bokma
> Ilya Zakharevich wrote:
>
> > Do you mean wget source, or wget itself?
>
> Latter, the not perl binary.
>
> > If the latter, then in my
> > experience, it can handle about 10% of (my) downloads. This is why
> > I'm forced to have a replacement in Perl...
>
> What doesn't wget handle?

Getting resources mirrored without aborting early in the process...
Or, if it does not abort, it creates a directory tree which is not
copyable to CD; or, if it is copyable to CD, the file names have
"wrong" extensions, so that one cannot tell that a file is an OGG
without running 'file' over it; something like:

xxx/search.asp?boo+far

There may be other problems - it "works" so rare that I use it very
rare - now that I have a Perl solution which works...

Hope this helps,
Ilya


Similar ThreadsPosted
lwp-download http://..--how do I use it to download pages? June 8, 2008, 5:47 pm
HTML::Template arbitraryily nested recursive loops April 8, 2005, 2:59 pm
download all CPAN modules ? August 25, 2004, 10:35 pm
where can download Filter::netcrypt February 22, 2005, 9:55 pm
How to download web sites with www::robot April 11, 2005, 2:31 am
RFC: Catalyst::View::Download March 5, 2008, 1:36 pm
download file from windows webserver box using LWP November 3, 2005, 4:31 am
Perl Database Programing download for the template January 9, 2008, 12:56 pm
download Modules from CPAN and then load into Perl Question October 15, 2004, 3:07 pm
LWP module - parse one line at a time (only download part of a page) January 20, 2006, 1:50 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap