Extracting table in html page

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

Hello All,

I have an html file where I am trying to extract a table. The problem
I am facing is there are lot of tables in the page and the table I am
looking to extract appears after a particular string say $some_text. I
know of a way that I can search for the string in the html page but
what I want to do is capture a table that immediately follows the

Any suggestions on how to do this ??


Re: Extracting table in html page

In article

Quoted text here. Click to load it

The most reliable way would be to use the HTML::Parser module to parse
the html file, register appropriate handlers for the table elements
(<table>, <tr>, <td>) and one for text elements, look for your string,
and process the next table encountered in a callback (handler
subroutines are called as callbacks by the parsing method).

Another way would be to use a module to extract tables from HTML. There
are at least two on CPAN: HTML::TableExtract and HTML::TableParser. The
problem using these is to find the table after the specified text. Is
there some other way of identifying the table?

The quick and dirty way is to use a regular expression (untested):

if( $html =~ m{ $some_text .*? <table> (.*?) </table> }isx ) {
  # table contents in $1

However, this will not always work. It fails if you have nested tables,
for example, which is a common occurrence in some HTML. However, if you
are in a hurry it might work for you. It is always better to use a real
parser for HTML.

Jim Gibson

Re: Extracting table in html page

Quoted text here. Click to load it

Quoted text here. Click to load it

Its ALWAYS trivial to parse a markup language's markup.
ie: parse out tags(open|close)/attributes and content.
Creating an element tree (document) with HTML is another
process altogether. Xhtml/Xml, not so bad, sgml er ..

I always laugh when people say a 'real parser for HTML' because they
don't know what thier saying, instead, just parroting phrases from
so called God's, then passing them along.
As if a SAX parser does nothing more than a realtime parse on a stream,
ie: a markup parse. Easily done by regular expressions.

Oh, and before anybody starts that "regular language" crap, they better
be able to explain what the "can't" part means!


Re: Extracting table in html page

Quoted text here. Click to load it

Or HTML::TreeBuilder;

use HTML::TreeBuilder;
use LWP::UserAgent;
my $url = 'http://www.example.com /...";
my $browser = LWP::UserAgent->new;
my $response = $browser->request (HTTP::Request->new(GET => $url));
if ($response->is_success) {
  my $tree = HTML::TreeBuilder->new;
  my $content =
  # search for text with look_down (there are other way)
  my $text = $content->look_down (...)
  # then for your table
  my $table = $content->look_down ('_tag', 'table', ...)



Re: Extracting table in html page

The best way can be:
use split on $some_text and throw away the first part.
my ($junk, $interest_html) = split (/$some_text/, $html);

on $interest_html - use HTML::TreeBuilder to parse the tables.
grab the first table - you are done.

Let me know if you find difficult to use HTML::TreeBuilder.

--sopan shewale

Quoted text here. Click to load it

Site Timeline