unicode: is decode-process-encode a "good" aproach?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

Thnx to Alan and Shawn for their reply to my last posting. I read a lot
of docs before, after and still do, but its all very confusing.

Finally I found an aproach that is actually working to me and I wanted
to ask you if this makes sense and *might* even work for longer or if it
just cries for troubles.

I read parameters delivered by the webbrowser (html-header is always
UTF-8 !!), and want to sort and lowercase them and print them out again.
I dont set STDIN and STDOUT to ":utf8", cause this does not work with

my $input=$cgi->param('myfield');
utf8::downgrade($input);     # otherwise sort will not sort according to
                  # my LC_COLLATE-setting and I need

                              # localized sort (mainly german data)

my $value=do_a_lot($input);  # do some dataprocessing including sorting

utf8::upgrade($value);       # otherwise the lc() in the next line would
                              # not lower chars like german umlauts
utf8::downgrade($value);     # to make sort work again

$value=do_a_lot_more($value); # do some more dataprocessing and sorting

print $value;

So is it ok to get the data somehow "raw" from the webinterface, then
decode it, process it and encode it again to print it out or is this a
rather stupid approach?

Is it normal that I need to decode values delivered by an webpage that
has UTF-8 charset in its header?

Is it ok to clear the utf-8 flag to make sorting work in a locale-way
and set the flag again to make lc() work?  Or does this just show that
there is something wrong in my script?
If I use Unicode::Collate I would not need this fiddling with utf-8, but
this is very slow (cause it loads the big allkeys.txt - file) and might
cause troubles in multithreaded applications (as I read somewhere)

I did not provide a full script, cause this posting is long enough that
way. Hope this is ok.

I also tried to replace the utf8::encode/decode with Encode::from_to but
failed so far, cause I actually dont know from what to what I like to
convert. One side is utf8 but what is the other side?

thnx a lot,


Re: unicode: is decode-process-encode a "good" aproach?

Quoted text here. Click to load it

I will say, as I often have: I would recommend using :encoding(utf8)
rather that :utf8, as you can then handle malformed utf8 properly.

Quoted text here. Click to load it

I would use Encode::decode here, as you'll get better error handling.

Quoted text here. Click to load it

If you haven't specified that the FH is utf8, then you'll have to decode it
by hand.

Quoted text here. Click to load it

Hmmmmmmm..... I think this is a bad idea. What if you have chars outside
ISO8859-1? I would strongly recommend using Encode::encode to convert it
to ISO8859-1 explicitly, and be prepared to handle errors.

If you read perlunicode it tells you that Unicode and locales currently
don't play nicely together; I'd probably recommend doing something like

my $iso = Encode::encode 'iso8859-1' => $utf8;
    use locale;
$utf8 = Encode::decode 'iso8859-1' => $iso;

so that you don't try and use unicode data when locales are switched on.


               We do not stop playing because we grow old;
                  we grow old because we stop playing.

Re: unicode: is decode-process-encode a "good" aproach?

Quoted text here. Click to load it

thnx. I got around all these problems now by finding an appropriate
locale for my needs : "de_AT.UTF-8". I get the input from a
non-utf8-filehandle, decode and then everythings works smoothly
including sorting, lowercasing, patternmatching (see below). Then I
encode and print out to non-utf8-filehandle again.

Quoted text here. Click to load it

perlunicode states that is discouraged, but it also explains a bit what
can happen and and at the end I dont have much of a choice but using
Unicode and locales.
The Data I need to process can definitely include many different
languages and charsets. And the handling (especially collate) should
definitely follow german rules. (german text that can include words from
any other language, including chinese and hindi and other things I never
heard of). And it should be fast ....

Your idea above looks very smart and I'll definitely give it a very
close look.  Currently all my locale-stuff work. (almost all - see my
other new posting where there is one construct that makes $s=~/$s/i fail !!)

thnx a lot,


Site Timeline