|
Posted by Perl_user on December 13, 2005, 3:28 pm
Please log in for more thread options
Hi,
I have been using HTML::Parser to extract the textual data from an HTML
document
I am using the following code:
my $p = HTML::Parser->new(api_version => 3,
start_h => [\&a_start_handler, "self,tagname"],
report_tags => [qw(title h1 h2 h3 h4 h5 h6)],
);
$p->parse_file($file || die) || die $!;
sub a_start_handler
{
my($self, $tag) = @_;
$self->handler(text => [], '@' );
$self->handler(start => \&text);
$self->handler(end => \&a_end_handler, "self,tagname,text");
}
sub text
{
my($self, $tag) = @_;
my $text=@;
}
sub a_end_handler
{
my($self, $tag) = @_;
my $text = join("", @);
$self->handler("text", undef);
$self->handler("start", \&a_start_handler);
$self->handler("end", undef);
}
which reports the title and headers from the page. This works, but I have a
problem getting the body text (seperately), as it isn't contained inside
HTML tags, that I can report.
Web-site...
------------
<html>
<head>
<title>This is the title of the webpage.</title>
</head>
<body>
<h1>First Type Header</h1>
<h2>Second Type Header</h2>
This is the main body of the text. It will be concidered as the article.
Blah Blah blah
</body>
</html>
------------
Any ideas appreciated
Output with reported tags [title h1 h2 h3 h4 h5 h6]
--
title
This is the title of the webpage.
h1
First type
h2
Second Type Header
h3
Third Type Header
--
Output with reported tag <body>
--
This is the title of the webpage. It is a mess.
First type header
Second Type Header
Third Type Header
This is the main body of the text. It will be concidered as the article.
--
with regards,
Perlusr
|
|
Posted by James E Keenan on January 1, 2006, 1:18 am
Please log in for more thread options
Perl_user wrote:
> Hi,
>
> I have been using HTML::Parser to extract the textual data from an HTML
> document
>
> I am using the following code:
>
> my $p = HTML::Parser->new(api_version => 3,
> start_h => [\&a_start_handler, "self,tagname"],
> report_tags => [qw(title h1 h2 h3 h4 h5 h6)],
> );
> $p->parse_file($file || die) || die $!;
Is this the entirety of your script? What comes next?
|
|
Posted by James E Keenan on January 1, 2006, 2:21 pm
Please log in for more thread options
Perl_user wrote:
> Hi,
>
> I have been using HTML::Parser to extract the textual data from an HTML
> document
>
> I am using the following code:
>
[snip]
>
> which reports the title and headers from the page. This works, but I have a
> problem getting the body text (seperately), as it isn't contained inside
> HTML tags, that I can report.
>
>
The code shown seems largely based on one of the examples provided in
the CPAN HTML::Parser documentation
(http://search.cpan.org/src/GAAS/HTML-Parser-3.48/eg/hanchors). If you
look at one of the other samples scripts in the same location
(http://search.cpan.org/src/GAAS/HTML-Parser-3.48/eg/htext), you should
be able to work up a solution.
Jim Keenan
|
| Similar Threads | Posted | | HTML-Parser-3.56 build problem | February 6, 2007, 4:32 am |
| MIME::Parser .. how to get just the message part of the body | September 7, 2005, 10:28 am |
| how to display html in email message body with mime:: entity | July 11, 2004, 10:45 pm |
| Possible bug in HTML::Parser | November 15, 2005, 5:05 pm |
| HTML::Parser error | December 1, 2005, 8:31 am |
| I want to learn something about HTML parser. | December 8, 2005, 12:12 am |
| HTML:Parser how to remove "//" ? | January 31, 2007, 6:00 am |
| How to text in HTML::Element | October 23, 2004, 7:31 pm |
| ANNOUNCE: spodcxx v0.21, a (s)POD Parser and (s)POD to HTML converter | August 3, 2005, 10:44 am |
| How to *modify* text in HTML::Element | October 23, 2004, 8:16 pm |
|