Click here to get back home

HTML::FormatText problem

 HomeNewsGroups | Search | About
 comp.lang.perl.modules    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
HTML::FormatText problem Emmett 05-06-2006
Get Chitika Premium
Posted by Emmett on May 6, 2006, 4:09 pm
Please log in for more thread options


Hi,

I have a curious problem with HTML::FormatText and I wonder if anybody
can help me.

I have a bunch of patent documents in a local directory from which I am
extracting the title, abstract, etc for each patent to insert into a
MySQL database. The core lines of the script where I am having problems
are:

use HTML::FormatText;
.....
my $plain_page =
HTML::FormatText->new->format(parse_htmlfile($local_patent_file))
...do regex stuff with $plain_page...

This works fine - except - it seems - when the patent document contains
the string "##STR1##" which is used in the patent documents to
represent a complex formula. This seems to kill HTML::FormatText, in
other words $plain_page is undefined.

Obviously '#' is used in Perl to represent a comment but I'm surprised
if it affects HTML::FormatText is such a simple way. Maybe ##X## does
something, I honestly don't know.

If anybody had any suggestions, opinions, work-arounds or alternative
suggestions I'd be very grateful.

Thanks

Emmett


Similar ThreadsPosted
HTML-Parser-3.56 build problem February 6, 2007, 4:32 am
IIS + Cache + W2003 + HTML header problem May 8, 2006, 8:24 am
Problem with body text extraction with HTML::Parser December 13, 2005, 3:28 pm
install HTML::Template - Problem reading cache file / Bad file number July 24, 2004, 7:55 pm
Mail:Sender - HTML Mail with alternatives problem July 21, 2004, 6:44 pm
[RFC] HTML::Dashboard (Spreadsheet-like formatting for HTML tables) April 16, 2007, 4:50 pm
HTML ---> PDF October 27, 2004, 2:13 am
HTML::TableExtract October 11, 2004, 9:30 pm
[RFC] HTML::FormatData May 13, 2005, 2:51 pm
[RFC] HTML::CheckArgs May 13, 2005, 2:49 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap