Detecting non-printing characters(?)

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Each day I use the LWP module to retrieve web pages then parse out useful
information from a single site, all OK except that occasionally a page does
not parse as expected. I examined the errant page but could see no
visual difference. I examined the source HTML line by line comparing the
errant page with a normal page but no visible difference at all.

Is it possible that there are non-printing characters present in the errant
pages that are causing my parser script to fail?
If so how can I detect and remove them?

Thanx for any assistance! Cheers, Peter

Re: Detecting non-printing characters(?)

Quoted text here. Click to load it

Removing non-printing characters is a simple matter of
    $page =~ s/[^[:print:]]+//g;

though you should be aware of potential issues with non-ASCII pages. You
may need to decode and then use \P instead.

If you have further questions, you will need to post the code you are
using for parsing, an example that works and an example that doesn't. If
you make the effort to cut down both code and examples to the minimum
required to demonstrate the failure, you are more likely to get a useful
response. (You are also more likely to figure out the problem on your
owm, which is better for everyone.)


Re: Detecting non-printing characters(?)

Quoted text here. Click to load it

Did you try a diff between the working and the errant page?

Quoted text here. Click to load it

Don't know, maybe. However IMO it's more likely that either the page is
not correct HTML (did you check with an HTML validator) and therefore
the parser chokes or the there is an error in your parser.

Quoted text here. Click to load it

Perl's regular expressions support the POSIX :print: character class.


Re: Detecting non-printing characters(?)

On Wed, 19 Aug 2009 23:14:33 GMT, "Peter Jamieson"

Quoted text here. Click to load it

What kind of error? Even wrong encodings should parse. I mean its
not die'ing is it?

How do you parse it, write it to file then pass the handler to the parser,
or just pass the buffer to it? Is the buffer bytes or utf8 promoted with
embed chars.

Have you examined the buffer with something like this?
    for (map {ord $_} split //, $line) {
        printf ("%x ",$_);

How does an errant page equal a normal page. Are they supposed to
be the same all the time?


Site Timeline