HTML Parsing issues - Part II

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I'm trying to have the following script parse

<table class="item_description">
                                <td>Acer Aspire AS5610-2089 Notebook, Intel Pentium Dual Core
T2080, 1.6 GHz, 1024GB, 160GB, DVD+/-R DL/DVD+RW Drive, 15.4" TFT,
WebCam, 56K Modem, Wireless, NIC, Vista Home Premium, Refurbished with
90 Day Warranty</td>

From the following url

Basically, I want the product description between the tags. I tried to
modify the following script

more input

$ more

use strict;
use warnings;

use HTML::TokeParser;
use LWP::Simple;
use LWP::UserAgent;
use HTML::LinkExtor;

#for privoxy
my $browser = LWP::UserAgent->new;
$browser->proxy( ['http', 'https' ], "http://localhost:8118");

my ($input_file) = @ARGV;
die "No input file specified\n" unless defined $input_file;

open my $INPUT, '<', $input_file
    or die "Cannot open '$input_file': $!";

    while ( my $id = <$INPUT> ) {
        chomp $id;

        my $url = make_url( $id );
        my $html = get($url);

        my $get_links = new HTML::LinkExtor;

        my @links = $get_links->links;
        foreach (@links) {
            # $_ contains [type, [name, value], ...]
            #print "Type: ", shift @$_, "\n";
#Start to parse the images. I think using a pop() vs a shift is
            shift @$_;
            while (my ($name, $value) = splice(@$_, 0, 2)) {
                if($value =~ / {
                print "  $name -> $value\n";

        unless ( defined $html ) {
            warn "Error downloading from '$url'\n";
            next ID;

        my $parser = HTML::TokeParser->new( $html );

        while ( my $token = $parser->get_tag('table') ) {
            if ( lc $token->[1] eq 'item_description' ) {
                my $td = $parser->get_tag('tr');
                last TABLE unless $td;
                my $cell = $parser->get_text('/tr');
                my %data;
                while ( $cell =~ /\s*([^:]+?):\s+(\d+)\s+/g ) {
                    $data = $2;
                use Data::Dumper;
                print Dumper \%data;

sub make_url {
    sprintf q{ }, $_[0];


$ ./ input
  src ->

But it doesn't work. I'm I using the wrong tags, is the regex
expression wrong, or both?

Re: HTML Parsing issues - Part II wrote:
Quoted text here. Click to load it

     use LWP::Simple;
     use HTML::TokeParser;

     my $html = get ' ';
     my $p = HTML::TokeParser->new( $html );

     while ( my $table = $p->get_tag('table') ) {
         last if $table->[1] and
           $table->[1] eq 'item_description';
     print $p->get_trimmed_text('/td');

Gunnar Hjalmarsson

Re: HTML Parsing issues - Part II

Quoted text here. Click to load it

How did you know to use $table->[1] and say not $table->[0]
? Is there something in the documentation that I missed?

Re: HTML Parsing issues - Part II wrote:
Quoted text here. Click to load it

I played with Data::Dumper to figure it out. Don't know if it can be
derived from the docs.

Gunnar Hjalmarsson

Site Timeline