Click here to get back home

Possible bug in HTML::Parser

 HomeNewsGroups | Search | About
 comp.lang.perl.modules    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Possible bug in HTML::Parser Mark 11-15-2005
Posted by Mark on November 15, 2005, 5:05 pm
Please log in for more thread options


Hello.

I am using the HTML::Parser module to parse a list of bookmarks
exported from the Firefox browser. Firefox exports bookmarks to an
HTML file containing nested definition lists.

I have discovered that when the parser encounters a bookmark
whose name ends in a closing parenthesis, the closing parenthesis
is stripped. (Bookmark names are coded as definition terms, using
the <dt> tag.)

A sample of the code being parsed looks like this:

<DT><A HREF="http://www.google.com" ADD_DATE="1101144594"
ID="rdf:#$.GjDP">Google (search engine)</A>

The decoded text passed to the handler by HTML::Parser
would be "Google (search engine".

Any ideas whether this is a bug in HTML::Parser, or should I
take another look at my code?

Thanks
-Mark




Posted by Bart Lateur on November 16, 2005, 8:37 am
Please log in for more thread options


Mark wrote:

><DT><A HREF="http://www.google.com" ADD_DATE="1101144594"
>ID="rdf:#$.GjDP">Google (search engine)</A>
>
>The decoded text passed to the handler by HTML::Parser
>would be "Google (search engine".

I've tried it with HTML::TokeParser::Simple, which is built on top of
HTML::Parser, and it comes out well:

        $html = << '--';
        <DT><A HREF="http://www.google.com" ADD_DATE="1101144594"
        ID="rdf:#$.GjDP">Google (search engine)</A>
        --
        use HTML::TokeParser::Simple;
        my $p = HTML::TokeParser::Simple->new( $html );

        while ( my $token = $p->get_token ) {
         print $token->as_is;
        }

This prints:

        <DT>
        <A HREF="http://www.google.com" ADD_DATE="1101144594"
        ID="rdf:#$.GjDP">
        Google (search engine)
        </A>

>Any ideas whether this is a bug in HTML::Parser, or should I
>take another look at my code?

My guess is that you only get part of the text, and you have to be
patient, because there is no garantee at all that all of the text will
come out in one chunk. So probably next time the text handler gets
called, the rest will come out... at least, part of it.

--
        Bart.


Posted by Mark on November 17, 2005, 9:03 am
Please log in for more thread options


>> Mark wrote:
>>
>> <DT><A HREF="http://www.google.com" ADD_DATE="1101144594"
>> ID="rdf:#$.GjDP">Google (search engine)</A>
>>
>> The decoded text passed to the handler by HTML::Parser
>> would be "Google (search engine".
>
> I've tried it with HTML::TokeParser::Simple, which is built on top of
> HTML::Parser, and it comes out well:
>

Ok, I've replicated your example using HTML::TokeParser::Simple.
But I would sure hate to scrap the hours I just spent learning
HTML::Parser, and re-write with TokeParser. After all, TokeParser
was supposedly written to save people from having to learn
HTML::Parser!

Can anyone here identify the problem with HTML::Parser, or
perhaps my (mis)use of this module? If TokeParser is based on
HTML::Parser, then it seems odd that it does not encounter
the same problem (unless it works around it somehow.)

Thanks
-Mark






Posted by Mark on November 17, 2005, 9:57 am
Please log in for more thread options


On the other hand, the following test works fine.
So I guess I need to take a closer look at my code.


use strict;
use HTML::Parser ();

my $txt = << 'EOTEXT';
<a href="http://www.microsoft.com">Microsoft (link)</a>
EOTEXT

my $p = HTML::Parser->new(api_version => 3);
$p->handler(text => \&text_handler, "dtext");
$p->parse($txt);

sub text_handler
{print shift, "\n";}




Posted by Bart Lateur on November 17, 2005, 7:48 pm
Please log in for more thread options


Mark wrote:

>On the other hand, the following test works fine.
>So I guess I need to take a closer look at my code.
>
>
>use strict;
>use HTML::Parser ();
>
>my $txt = << 'EOTEXT';
>EOTEXT
>
>my $p = HTML::Parser->new(api_version => 3);
>$p->handler(text => \&text_handler, "dtext");
>$p->parse($txt);
>
>sub text_handler
>{print shift, "\n";}

Are you sure the original problem doesn't produce:

        Google (search engine
        )

?
Thus, the text handler called twice?

--
        Bart.


Similar ThreadsPosted
HTML::Parser error December 1, 2005, 8:31 am
I want to learn something about HTML parser. December 8, 2005, 12:12 am
HTML:Parser how to remove "//" ? January 31, 2007, 6:00 am
HTML-Parser-3.56 build problem February 6, 2007, 4:32 am
ANNOUNCE: spodcxx v0.21, a (s)POD Parser and (s)POD to HTML converter August 3, 2005, 10:44 am
HTML-Parser: storing into a DB words with special chars September 21, 2005, 2:40 am
Woes installing HTML::Parser using -MCPAN or by hand September 3, 2005, 2:11 am
Problem with body text extraction with HTML::Parser December 13, 2005, 3:28 pm
[RFC] HTML::Dashboard (Spreadsheet-like formatting for HTML tables) April 16, 2007, 4:50 pm
Need Help with XML::Parser July 5, 2005, 12:53 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap