Do you have a question? Post it now! No Registration Necessary. Now with pictures!
- Posted on
- UTF-8 in regexp with 5.8.1
- Wes Groleau
April 11, 2005, 12:22 am
rate this thread
I have a file containing thousands of Spanish words, encoded AFAIK)
in UTF-8. I also have a perl script in UTF-8, which says (hope
#!/usr/bin/perl -w -CSD
# NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)
print if ( /ñ/ )
What is in the regexp is supposed to be "small n with tilde"
and I verified with od -xc that it is hex C3 B1 as is every
place in the file where that letter appears.
The script is intended to find all words containing that
letter. But it finds nothing. After wading through gallons
of text (man encoding, man utf8, man perlunicode, etc.),
I still had no reason to think it was wrong. But I added
use encoding "utf8";
and ran it again, getting only:
Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.
?!? According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.
Have I found a bug in perl or is my ignorance just getting
the best of me?
Oh, yeah, I also tried a few things with 'binmode' that didn't
Re: UTF-8 in regexp with 5.8.1
On Sun, 10 Apr 2005, Wes Groleau wrote:
> I have a file containing thousands of Spanish words, encoded AFAIK)
> in UTF-8.
Well, your whole report stands or falls by that "AFAIK", so it might
be useful to have a test case, including data, which we could run for
ourselves (preferably on a web page, to exclude any possibility of
lossage in usenet postings) to help pin-down your problem.
> I also have a perl script in UTF-8,
Noted, although I don't see any compelling reason to code the script
itself in utf-8. Sure, you /can/ do, but it seems to me to be a
potential additional complication that one could do well to avoid
> #!/usr/bin/perl -w -CSD
> # NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)
Do you have a cite on that? My knowledge of this area is admittedly
somewhat limited, but I hadn't met this before.
> What is in the regexp is supposed to be "small n with tilde"
> and I verified with od -xc that it is hex C3 B1 as is every
> place in the file where that letter appears.
Sounds good. That even seems to have worked in your usenet posting,
as far as I can see.
> use encoding "utf8";
> and ran it again, getting only:
> Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
> Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> immediately after start byte 0xde) at
> /Volumes/Parents/wgroleau/bin/char-find line 12.
> According to 'od -xc' the script does NOT contain any
> byte that is 0xde In fact, the ONLY bytes in the script that
> are not ASCII are the bytes for the "enye" which are on line
> twelve, but neither of them is a DE and NO bytes are 00.
I've successfully processed utf-8 and utf-16 data without the use of
the -C flag(s), by using explicit binmode() on the relevant files.
If you could at least get one working variant of your script, you
could then at least move forward from there.
Sorry, this is a bit inconclusive, as yet.
Re: UTF-8 in regexp with 5.8.1
Alan J. Flavell wrote:
Well, I told my editor to save it as UTF-8, and I think it works.
(When I save web pages that way, and specify UTF-8 in a META tag,
Spanish, French, Polish, and Japanese characters are correctly
rendered by most browsers.)
Well, in this case, I am trying to regexp a non-ASCII character.
Since I am an easily-distracted (A.D.D.) type, and I work with
several different character sets, I am attempting to standardize
on UTF-8 rather than constantly be debugging places where I forgot
to make a switch. :-)
Oh, I reported that a while back. If I take the space out
on Mac OS X, I get frequent segment violations. If I remove
the space on NetBSD/Alpha, I get consistent nasty-grams about
the wrong method of invoking the debugger.
I tried a couple of things with binmode that also didn't work,
but I don't remember exactly what happened.
Well, a post in another thread made me try removing the
"use utf8" and it worked. So, I really think this is
- A regexp containing a non-ASCII character in
correct UTF-8 encoding works.
- Add "use utf8" and it silently stops working.
- Add 'use encoding "utf8"' and you get chewed out
for having invalid UTF-8, in a message that bitches
about the presence of bytes that don't exist.
I'll send it in .....
He that is good for making excuses, is seldom good for anything else.
-- Benjamin Franklin