[RFC] URI::URL::Detail

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!


             I am hoping to add a module to CPAN and I was hoping to
get some feedback/comments/ideas. The functionality of this Module is
detailed below. I am not really sure what I should call it. So far I
have the following options in mind:


The module is intended to be used as part of a web crawler although I
have found myself using parts of it elsewhere.

The basic functionality of the proposed module will include:

Given the HTML of a page
      Find all anchor elements - broken into "this domain links" and
"other domain links".
      Find the Title, description and other such meta data.
Split up an anchor tag into : The URL, the alt text and the anchor
Given a potentially relative URL and the current URL, returns the
absolute URL.
Given a potential redirecting URL, returns the final destination URL.
Breaks up a URL into Protocol, domain and URI
"Clean" a URL so it can be used as a string ( in say Regular
expressions or MySql insert statements ).

I intend to make these functions available both independently and
together in an Object Oriented structure.

The OO part would look something like this:

my $b = new foo::bar {

   CURRENT_URL              => 'www.site_i_am_crawling.com/
page_i_am_crawling.html', ## New will croak if this is not provided.
   FIND_CONTAINED_URLS      => 1 , ## Default 1
   BREAK_CONTAINED_URLS    => 1 , ## Default 1
   ABSOLUTE_CONTAINED_URLS  => 1 , ## Default 1
   CLEAN_URLS               => 1 , ## Default 1

   CURRENT_URL_HTML         => "long string here", ## Optional, will
be extracted if this is not provided.

   USER-AGENT                => '' ,
   TIMEOUT                  => 5  ,

   DEBUG                    => 0



    ## Can reset object parameters here.
    ## All processing will be performed only when this function is


my @array_of_urls = $b->get_contained_urls();


my @array_of_urls = get_contained_urls( URL => '', HTML => '' );

my $all_results = $b->get_all_results();

The following is a list of existing CPAN modules that are similar to
the one proposed here.

WWW::Spider         - Far too advanced to be used in this context.

Similar to "Find absolute"

Get the html ( and find elements )
    URI::Title::HTML    -  No POD, gets titles only.
    HTML::HeadParser    -  Parses only the HEAD.
    HTML::TreeBuilder   -  Overkill?

Clean string ( for MySql, and RegEx )
    CGI::Untaint - Indirect use.

Break contained URLs
    URI - There are several ways to achieve this including a simple
RegEx. This functionality is included here for completeness.

Site Timeline