|
Posted by Mark on November 15, 2005, 5:05 pm
Please log in for more thread options
Hello.
I am using the HTML::Parser module to parse a list of bookmarks
exported from the Firefox browser. Firefox exports bookmarks to an
HTML file containing nested definition lists.
I have discovered that when the parser encounters a bookmark
whose name ends in a closing parenthesis, the closing parenthesis
is stripped. (Bookmark names are coded as definition terms, using
the <dt> tag.)
A sample of the code being parsed looks like this:
<DT><A HREF="http://www.google.com" ADD_DATE="1101144594"
ID="rdf:#$.GjDP">Google (search engine)</A>
The decoded text passed to the handler by HTML::Parser
would be "Google (search engine".
Any ideas whether this is a bug in HTML::Parser, or should I
take another look at my code?
Thanks
-Mark
|
|
Posted by Bart Lateur on November 16, 2005, 8:37 am
Please log in for more thread options
Mark wrote:
>ID="rdf:#$.GjDP">Google (search engine)</A>
>
>The decoded text passed to the handler by HTML::Parser
>would be "Google (search engine".
I've tried it with HTML::TokeParser::Simple, which is built on top of
HTML::Parser, and it comes out well:
$html = << '--';
<DT><A HREF="http://www.google.com" ADD_DATE="1101144594"
ID="rdf:#$.GjDP">Google (search engine)</A>
--
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( $html );
while ( my $token = $p->get_token ) {
print $token->as_is;
}
This prints:
<DT>
<A HREF="http://www.google.com" ADD_DATE="1101144594"
ID="rdf:#$.GjDP">
Google (search engine)
</A>
>Any ideas whether this is a bug in HTML::Parser, or should I
>take another look at my code?
My guess is that you only get part of the text, and you have to be
patient, because there is no garantee at all that all of the text will
come out in one chunk. So probably next time the text handler gets
called, the rest will come out... at least, part of it.
--
Bart.
|
|
Posted by Mark on November 17, 2005, 9:03 am
Please log in for more thread options
>> Mark wrote:
>>
>> ID="rdf:#$.GjDP">Google (search engine)</A>
>>
>> The decoded text passed to the handler by HTML::Parser
>> would be "Google (search engine".
>
> I've tried it with HTML::TokeParser::Simple, which is built on top of
> HTML::Parser, and it comes out well:
>
Ok, I've replicated your example using HTML::TokeParser::Simple.
But I would sure hate to scrap the hours I just spent learning
HTML::Parser, and re-write with TokeParser. After all, TokeParser
was supposedly written to save people from having to learn
HTML::Parser!
Can anyone here identify the problem with HTML::Parser, or
perhaps my (mis)use of this module? If TokeParser is based on
HTML::Parser, then it seems odd that it does not encounter
the same problem (unless it works around it somehow.)
Thanks
-Mark
|
|
Posted by Mark on November 17, 2005, 9:57 am
Please log in for more thread options
On the other hand, the following test works fine.
So I guess I need to take a closer look at my code.
use strict;
use HTML::Parser ();
my $txt = << 'EOTEXT';
<a href="http://www.microsoft.com">Microsoft (link)</a>
EOTEXT
my $p = HTML::Parser->new(api_version => 3);
$p->handler(text => \&text_handler, "dtext");
$p->parse($txt);
sub text_handler
{print shift, "\n";}
|
|
Posted by Bart Lateur on November 17, 2005, 7:48 pm
Please log in for more thread options
Mark wrote:
>On the other hand, the following test works fine.
>So I guess I need to take a closer look at my code.
>
>
>use strict;
>use HTML::Parser ();
>
>my $txt = << 'EOTEXT';
>EOTEXT
>
>my $p = HTML::Parser->new(api_version => 3);
>$p->handler(text => \&text_handler, "dtext");
>$p->parse($txt);
>
>sub text_handler
>{print shift, "\n";}
Are you sure the original problem doesn't produce:
Google (search engine
)
?
Thus, the text handler called twice?
--
Bart.
|
| Similar Threads | Posted | | HTML::Parser error | December 1, 2005, 8:31 am |
| I want to learn something about HTML parser. | December 8, 2005, 12:12 am |
| HTML:Parser how to remove "//" ? | January 31, 2007, 6:00 am |
| HTML-Parser-3.56 build problem | February 6, 2007, 4:32 am |
| ANNOUNCE: spodcxx v0.21, a (s)POD Parser and (s)POD to HTML converter | August 3, 2005, 10:44 am |
| HTML-Parser: storing into a DB words with special chars | September 21, 2005, 2:40 am |
| Woes installing HTML::Parser using -MCPAN or by hand | September 3, 2005, 2:11 am |
| Problem with body text extraction with HTML::Parser | December 13, 2005, 3:28 pm |
| [RFC] HTML::Dashboard (Spreadsheet-like formatting for HTML tables) | April 16, 2007, 4:50 pm |
| Need Help with XML::Parser | July 5, 2005, 12:53 pm |
|