Click here to get back home

Problem with body text extraction with HTML::Parser

 HomeNewsGroups | Search | About
 comp.lang.perl.modules    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Problem with body text extraction with HTML::Parser Perl_user 12-13-2005
Posted by Perl_user on December 13, 2005, 3:28 pm
Please log in for more thread options


Hi,

I have been using HTML::Parser to extract the textual data from an HTML
document

I am using the following code:

my $p = HTML::Parser->new(api_version => 3,
start_h => [\&a_start_handler, "self,tagname"],
report_tags => [qw(title h1 h2 h3 h4 h5 h6)],
);
$p->parse_file($file || die) || die $!;

sub a_start_handler
{
my($self, $tag) = @_;
$self->handler(text => [], '@' );
$self->handler(start => \&text);
$self->handler(end => \&a_end_handler, "self,tagname,text");
}

sub text
{
my($self, $tag) = @_;
my $text=@;

}

sub a_end_handler
{
my($self, $tag) = @_;
my $text = join("", @);
$self->handler("text", undef);
$self->handler("start", \&a_start_handler);
$self->handler("end", undef);
}

which reports the title and headers from the page. This works, but I have a
problem getting the body text (seperately), as it isn't contained inside
HTML tags, that I can report.


Web-site...
------------
<html>
<head>
<title>This is the title of the webpage.</title>
</head>
<body>
<h1>First Type Header</h1>
<h2>Second Type Header</h2>
This is the main body of the text. It will be concidered as the article.
Blah Blah blah
</body>
</html>
------------

Any ideas appreciated


Output with reported tags [title h1 h2 h3 h4 h5 h6]
--
title
This is the title of the webpage.
h1
First type
h2
Second Type Header
h3
Third Type Header
--

Output with reported tag <body>
--
This is the title of the webpage. It is a mess.
First type header
Second Type Header
Third Type Header
This is the main body of the text. It will be concidered as the article.
--

with regards,
Perlusr




Posted by James E Keenan on January 1, 2006, 1:18 am
Please log in for more thread options


Perl_user wrote:
> Hi,
>
> I have been using HTML::Parser to extract the textual data from an HTML
> document
>
> I am using the following code:
>
> my $p = HTML::Parser->new(api_version => 3,
> start_h => [\&a_start_handler, "self,tagname"],
> report_tags => [qw(title h1 h2 h3 h4 h5 h6)],
> );
> $p->parse_file($file || die) || die $!;

Is this the entirety of your script? What comes next?


Posted by James E Keenan on January 1, 2006, 2:21 pm
Please log in for more thread options


Perl_user wrote:
> Hi,
>
> I have been using HTML::Parser to extract the textual data from an HTML
> document
>
> I am using the following code:
>
[snip]

>
> which reports the title and headers from the page. This works, but I have a
> problem getting the body text (seperately), as it isn't contained inside
> HTML tags, that I can report.
>
>
The code shown seems largely based on one of the examples provided in
the CPAN HTML::Parser documentation
(http://search.cpan.org/src/GAAS/HTML-Parser-3.48/eg/hanchors). If you
look at one of the other samples scripts in the same location
(http://search.cpan.org/src/GAAS/HTML-Parser-3.48/eg/htext), you should
be able to work up a solution.

Jim Keenan


Similar ThreadsPosted
HTML-Parser-3.56 build problem February 6, 2007, 4:32 am
MIME::Parser .. how to get just the message part of the body September 7, 2005, 10:28 am
how to display html in email message body with mime:: entity July 11, 2004, 10:45 pm
Possible bug in HTML::Parser November 15, 2005, 5:05 pm
HTML::Parser error December 1, 2005, 8:31 am
I want to learn something about HTML parser. December 8, 2005, 12:12 am
HTML:Parser how to remove "//" ? January 31, 2007, 6:00 am
How to text in HTML::Element October 23, 2004, 7:31 pm
ANNOUNCE: spodcxx v0.21, a (s)POD Parser and (s)POD to HTML converter August 3, 2005, 10:44 am
How to *modify* text in HTML::Element October 23, 2004, 8:16 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap