Extracting a table from a webpage

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

Hope this is the right group, I dont usually post but am really stuck.

I would like to scrape all the values from the table

But im having difficulty getting HTML::TableExtract to achieve this, I
keep returning null values.

The other thing is I want to get all the pages, as you can see from
that page theres something like ~3800 lines in the table.

I have already tried to manipulate my http POST's with the firefox
plugin Tamper Data (great extension, comes highly recommended!) but
the script that serves that page is well written and guards against
this. So I tried to look at the http transfers that cause the "next
button" at the bottom, this has led me to find that it produces an
absolutly massive string, that I can't even begin to understand, plus
I think it uses some sort of validation process based on the field
names (e.g. "__EVENTVALIDATION")

Any advice even its just on the scraping would be greatly recieved.

Kind regards and thanks in advance


Re: Extracting a table from a webpage

On Mon, 28 Apr 2008 13:50:57 -0700, googlinggoogler@hotmail.com wrote:

Quoted text here. Click to load it
Quoted text here. Click to load it

It's difficult to analyze your problem without seeing the code you are
using. HTML::TableExtract shouldn't have a problem getting that table
out. I happened to have an old table extracting script lying around,
which I've modified for your case:

use warnings;
use strict;
use HTML::TableExtract;
use LWP::Simple;
my $isafilename = "isa.html";
if (!-f $isafilename) {
    my $isaurl = "url goes here";
    my $isadata = get($isaurl);
    open my $isafile, ">", $isafilename or die $!;
    print $isafile $isadata;
    close $isafile or die $!;
my $te = HTML::TableExtract->new();
foreach my $ts ($te->tables) {
    print "Table found at ", join(',', $ts->coords), " with ";
    print scalar(@), " rows\n";

This worked correctly for me & found four tables in the page.

Quoted text here. Click to load it

Hmm, I manually changed the tab= string in the URL, to "tab=2" and
"tab=3" etc. and got the subsequent tables correctly, so it doesn't seem
to me that they are trying to hide the data.

Re: Extracting a table from a webpage

googlinggoogler@hotmail.com wrote:
Quoted text here. Click to load it
Quoted text here. Click to load it

I decided to play a little with HTML::TableExtract, and this worked fine:

     my $te = HTML::TableExtract->new( headers => [
       qw(Fund\sName Risk Std\sDev YTD 1\sYr 3\sYr\nAnlsd 5\sYr 10\sYr)
     ], );
     printf "%-42s%-13s%7s%7s%7s%7s%7s%7s\n", @$_
       for ($te->tables)[0]->rows;

Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Site Timeline