Perl: Win-32 vs. linux

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

Is there any reason the following would work on a linux installation of
Perl, but not using ActivePerl-5.8 on a Win-32 system?  The tr///
operation successfully removes UTF-8 encoded   characters from the
string in linux, but not Win-32, even after verifying that all required
modules are installed.  Any thoughts would be greatly appreciated!


use LWP::Simple;
use Encode;

my $URL =
" ";

$content = get($URL);
$decoded = decode("utf-8"=>$content);
$decoded =~ tr/\x/ /;

print $decoded;

Re: Perl: Win-32 vs. linux

Maqo wrote:

Quoted text here. Click to load it

By best guess: decode maps to the internal encoding used by Perl, and I
guess on Windows this is Win-something, and on Linux ISO-something.

Why not remove the utf-8 encoded non-breakable spaces before the decoding

John                   Small Perl scripts:
               Perl programmer available:
            Happy Customers:

Re: Perl: Win-32 vs. linux

Quoted text here. Click to load it

two thoughts and a question:

1) does adding:

use utf8;

make any difference?

2) Print $decoded to a file before the tr// step to see if it has a string
  or U+00a0 or just a mess. This will help in finding where in the code
the problem is.


Where in the script is the   string in the original html being parsed
into a U+00a0 ? Does get() do this automatically? I can't see anything in
the documentation that mentions this, but I can't see why it would happen on
a Linux system and not on Windows.


Re: Perl: Win-32 vs. linux

On Thu, 2 Jun 2005, John Bokma wrote:

Quoted text here. Click to load it

I don't know why you think it's appropriate to "guess" this.  The Perl
documentation is pretty clear about how characters are stored
internally (perldoc perlunicode), and if there *was* a difference, one
would expect to find it in the appropriate platform-specific perl

The only thing that comes to mind is if the code calls Win32 *system*
functions, it may be necessary to run it with "wide system calls"
enabled.  But that doesn't appear to be happening here.

If this was my problem, I'd be inclined to prepare a small test
document which I /knew/ contained these actual characters (as opposed
to containing   character entity references, I mean), rather than
relying on some massive web document from elsewhere; and print out in
detail what's going on internally.  But that's only for diagnosis
purposes: Perl's unicode implementation works best when you just use
it, not mess around with internals.

(I really can't be bothered to wade through the whole mess of HTML and
javascript contained at the cited URL to get further with this,

Quoted text here. Click to load it

It worries me that the questioner writes:

| The tr/// operation successfully removes UTF-8 encoded  
| characters from the string in linux, but not Win-32

I see that the source contains quite a number of   character
entitity references.  So the question is, are we really talking about
no-break space *characters*, or are we talking about their character
entity references?

If we're really talking about *characters*, then note the "Caveat" in
the documentation for decode() in Encode:

 When you run $string = decode("utf8", $octets), then $string may not
 be equal to $octets. Though they both contain the same data, the utf8
 flag for $string is on unless $octets entirely consists of ASCII data
 (or EBCDIC on EBCDIC machines). See The UTF-8 flag below.

There's too much fiddling with internals going on here, IMHO.  Perl's
unicode implementation usually works best when you just use it.  The
web document in question is sent as utf-8 from its server, by the way.

Site Timeline