|
Posted by Ilya Zakharevich on February 4, 2005, 1:12 am
Please log in for more thread options
For the purpose of recursive download of documents, I need to extract
and categorize the embedded "links" of the document. Currently, I do
by a quick-and-dirty homegrown scheme, but I would prefer to do it in
a more robust way.
LinkExtor covers about a couple of dozens of different types of links,
but it makes no distinction about the very different nature of these
links:
a) Links which are just "abstract URLs", and not supposed to be ever
retrieved (base URL and many of its flavors);
b) Links which are "natural parts" of the current document: content
of these links is supposed to change the appearence of the
document in some browsers (style sheets, images, frames);
c) Complimentary links which may add positive browsing experience,
but are not essential for representation of document (favicon;
maybe more?);
d) Links which are supposed to be made "hot" in the document (user
is supposed to have a simple way to go to these links): <a>,
"prev"/"next"/"up"/"content"/"index" links, etc.
e) Other categories?
Is there a way to map a link to one of these categories without trying
to understand all of HTML 4.0 standard?
Thanks,
Ilya
|
|
Posted by John Bokma on February 4, 2005, 6:54 pm
Please log in for more thread options
Ilya Zakharevich wrote:
> For the purpose of recursive download of documents, I need to extract
> and categorize the embedded "links" of the document. Currently, I do
> by a quick-and-dirty homegrown scheme, but I would prefer to do it in
> a more robust way.
>
> LinkExtor covers about a couple of dozens of different types of links,
> but it makes no distinction about the very different nature of these
> links:
>
> a) Links which are just "abstract URLs", and not supposed to be ever
> retrieved (base URL and many of its flavors);
>
> b) Links which are "natural parts" of the current document: content
> of these links is supposed to change the appearence of the
> document in some browsers (style sheets, images, frames);
>
> c) Complimentary links which may add positive browsing experience,
> but are not essential for representation of document (favicon;
> maybe more?);
>
> d) Links which are supposed to be made "hot" in the document (user
> is supposed to have a simple way to go to these links): <a>,
> "prev"/"next"/"up"/"content"/"index" links, etc.
>
> e) Other categories?
>
> Is there a way to map a link to one of these categories without trying
> to understand all of HTML 4.0 standard?
One word: wget
Have a look at it, you can give a list of extensions to include,
exclude.
--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html
|
|
Posted by Ilya Zakharevich on February 4, 2005, 10:25 pm
Please log in for more thread options [A complimentary Cc of this posting was sent to
John Bokma
> > For the purpose of recursive download of documents, I need to extract
> > and categorize the embedded "links" of the document. Currently, I do
> > by a quick-and-dirty homegrown scheme, but I would prefer to do it in
> > a more robust way.
> >
> > LinkExtor covers about a couple of dozens of different types of links,
> > but it makes no distinction about the very different nature of these
> > links:
> > Is there a way to map a link to one of these categories without trying
> > to understand all of HTML 4.0 standard?
>
> One word: wget
>
> Have a look at it, you can give a list of extensions to include,
> exclude.
Do you mean wget source, or wget itself? If the latter, then in my
experience, it can handle about 10% of (my) downloads. This is why
I'm forced to have a replacement in Perl...
Hope this helps,
Ilya
|
|
Posted by John Bokma on February 4, 2005, 10:40 pm
Please log in for more thread options Ilya Zakharevich wrote:
> Do you mean wget source, or wget itself?
Latter, the not perl binary.
> If the latter, then in my
> experience, it can handle about 10% of (my) downloads. This is why
> I'm forced to have a replacement in Perl...
What doesn't wget handle?
Only thing I don't like about wget is that it doesn't work in parallel. (Or
I missed that option).
--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html
|
|
Posted by Ilya Zakharevich on February 5, 2005, 9:05 pm
Please log in for more thread options [A complimentary Cc of this posting was sent to
John Bokma
> Ilya Zakharevich wrote:
>
> > Do you mean wget source, or wget itself?
>
> Latter, the not perl binary.
>
> > If the latter, then in my
> > experience, it can handle about 10% of (my) downloads. This is why
> > I'm forced to have a replacement in Perl...
>
> What doesn't wget handle?
Getting resources mirrored without aborting early in the process...
Or, if it does not abort, it creates a directory tree which is not
copyable to CD; or, if it is copyable to CD, the file names have
"wrong" extensions, so that one cannot tell that a file is an OGG
without running 'file' over it; something like:
xxx/search.asp?boo+far
There may be other problems - it "works" so rare that I use it very
rare - now that I have a Perl solution which works...
Hope this helps,
Ilya
|
| Similar Threads | Posted | | lwp-download http://..--how do I use it to download pages? | June 8, 2008, 5:47 pm |
| HTML::Template arbitraryily nested recursive loops | April 8, 2005, 2:59 pm |
| download all CPAN modules ? | August 25, 2004, 10:35 pm |
| where can download Filter::netcrypt | February 22, 2005, 9:55 pm |
| How to download web sites with www::robot | April 11, 2005, 2:31 am |
| RFC: Catalyst::View::Download | March 5, 2008, 1:36 pm |
| download file from windows webserver box using LWP | November 3, 2005, 4:31 am |
| Perl Database Programing download for the template | January 9, 2008, 12:56 pm |
| download Modules from CPAN and then load into Perl Question | October 15, 2004, 3:07 pm |
| LWP module - parse one line at a time (only download part of a page) | January 20, 2006, 1:50 pm |
|