|
Posted by Ben Morrow on February 26, 2008, 6:40 am
Please log in for more thread options
>
> I suggest you also look into, and play around with, the functions
> is_utf8(), _utf8_on(), _utf8_off() and from_to(). This will give you a
> good overall picture of how Perl does UTF-8, but you probably won't need
> these here.
No, don't touch any of the functions in the utf8:: namespace. They are
part of the internals of perl's Unicode implementation, and shouldn't be
used by ordinary Perl code; especially _utf8_. Use the functions
in Encode:: instead. Note also that from_to gives you a non-character
string as output: if you want to manipulate the characters from Perl,
you should use decode, and then encode to your desired output encoding
when you've done.
> > #!/usr/bin/perl
> > use strict;
> > use utf8;
> > use Unicode::Lite;
use Encode qw/encode decode/;
> > # string in cp1250 codepage
> > my $wintxt="\xec\x9a\xe8\xf8\x9e\xfd\xe1\xed\xe9";
> >
> > # convert to utf8
> > my $utftxt=convert('CP1250','UTF8',$wintxt);
my $txt = decode CP1250 => $wintxt;
> > print "<br>Win: $wintxt, ",uc($wintxt),"<br>utf8:$utftxt, ",uc($utftxt),"\n";
You can't 'uc' $wintxt: it is a binary string, not a character string,
so the operation is fairly meaningless (the fact perl will actually do
something fairly sensible is not a reason to rely on that). If you
*really* want mixed CP1250/UTF8 output, you need to do something like
my $uctxt = uc $txt;
printf '<br>Win: %s, %s<br>utf8: %s, %s\n',
encode(CP1250 => $txt),
encode(CP1250 => $uctxt),
encode(utf8 => $txt),
encode(CP1250 => $uctxt);
but more likely you want to pick one encoding and stick to it. In that
case you can get perl to do the encoding for you with and :encoding
PerlIO layer (see PerlIO::encoding).
Ben
|
|
Posted by Petr Vileta on February 26, 2008, 8:50 am
Please log in for more thread options
Ben Morrow wrote:
>>
>>> use strict;
>>> use utf8;
>>> use Unicode::Lite;
>
> use Encode qw/encode decode/;
>
>>> # string in cp1250 codepage
>>> my $wintxt="\xec\x9a\xe8\xf8\x9e\xfd\xe1\xed\xe9";
>>>
>>> # convert to utf8
>>> my $utftxt=convert('CP1250','UTF8',$wintxt);
>
> my $txt = decode CP1250 => $wintxt;
>
>>> print "<br>Win: $wintxt, ",uc($wintxt),"<br>utf8:$utftxt,
>>> ",uc($utftxt),"\n";
>
> You can't 'uc' $wintxt: it is a binary string, not a character string,
> so the operation is fairly meaningless (the fact perl will actually do
> something fairly sensible is not a reason to rely on that). If you
> *really* want mixed CP1250/UTF8 output, you need to do something like
>
Thank you for response.
My data source is in cp1250 so I *really* need to use this codepage. I want to
read data (in cp1250), convert it *as is* to utf8 and now use uc() or lc().
Mean you this code will work properly?
use strict;
use utf8;
use Encode qw/encode decode/;
# string in cp1250 codepage
my $wintxt="\xec\x9a\xe8\xf8\x9e\xfd\xe1\xed\xe9";
# convert to utf8
my $utftxt = Encode::decode( 'CP1250', $wintxt );
print "<br>utf8:$utftxt, " uppercase utf8: ", uc($utftxt), "\n";
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to <petr AT practisoft DOT cz>
|
|
Posted by Peter J. Holzer on February 26, 2008, 7:11 pm
Please log in for more thread options > Ben Morrow wrote:
>>>> # string in cp1250 codepage
>>>> my $wintxt="\xec\x9a\xe8\xf8\x9e\xfd\xe1\xed\xe9";
>>>>
>>>> # convert to utf8
>>>> my $utftxt=convert('CP1250','UTF8',$wintxt);
>>
>> my $txt = decode CP1250 => $wintxt;
>>
>>>> print "<br>Win: $wintxt, ",uc($wintxt),"<br>utf8:$utftxt,
>>>> ",uc($utftxt),"\n";
>>
>> You can't 'uc' $wintxt: it is a binary string, not a character string,
>> so the operation is fairly meaningless (the fact perl will actually do
>> something fairly sensible is not a reason to rely on that). If you
>> *really* want mixed CP1250/UTF8 output, you need to do something like
>>
> Thank you for response.
> My data source is in cp1250 so I *really* need to use this codepage. I want to
> read data (in cp1250),
What "data source" is this? If it's a file, you can can do the
conversion in an I/O layer. If it's a database, the DBD driver may be
able to do the conversion.
> convert it *as is* to utf8 and now use uc() or lc(). Mean you this
> code will work properly?
>
> use strict;
> use utf8;
> use Encode qw/encode decode/;
>
> # string in cp1250 codepage
> my $wintxt="\xec\x9a\xe8\xf8\x9e\xfd\xe1\xed\xe9";
> # convert to utf8
> my $utftxt = Encode::decode( 'CP1250', $wintxt );
Decode converts to a "character string". This happens to be represented
in UTF-8 internally (which is unfortunately visiible in many function
names), but don't think of it that way.
> print "<br>utf8:$utftxt, " uppercase utf8: ", uc($utftxt), "\n";
For the output, you need to encode the strings. (You can only send bytes
over a socket, not characters). You can do this explicitely:
print encode('utf-8', "<br>utf8:$utftxt, " uppercase utf8: " . uc($utftxt) .
"\n");
or let the perl I/O system do it implicitely:
binmode STDOUT, ":utf8";
Needless to say I find the latter much simpler and less error-prone.
hp
|
|
Posted by Ben Morrow on February 26, 2008, 8:52 pm
Please log in for more thread options
> Ben Morrow wrote:
> >
> > my $txt = decode CP1250 => $wintxt;
> >
> >>> print "<br>Win: $wintxt, ",uc($wintxt),"<br>utf8:$utftxt,
> >>> ",uc($utftxt),"\n";
> >
> > You can't 'uc' $wintxt: it is a binary string, not a character string,
> > so the operation is fairly meaningless (the fact perl will actually do
> > something fairly sensible is not a reason to rely on that). If you
> > *really* want mixed CP1250/UTF8 output, you need to do something like
>
> My data source is in cp1250 so I *really* need to use this codepage.
You data source (input) and your output do not need to be in the same
encoding.
> I want to read data (in cp1250), convert it *as is* to utf8 and now
> use uc() or lc().
No, that is meaningless. uc and lc work on characters, not on
UTF8-encoded bytes strings. So you want to
1. ead in binary data (make sure you use binmode on your
filehandles, btw),
2. convert that binary data into characters, using the CP1250
encoding (Encode::decode),
3. uc or lc those characters,
4. convert them back into binary data, using any encoding of your
choice (Encode::encode),
5. write that data out to a filehandle (again, make sure you use
binmode).
> Mean you this code will work properly?
>
> use strict;
> use utf8;
> use Encode qw/encode decode/;
>
> # string in cp1250 codepage
> my $wintxt="\xec\x9a\xe8\xf8\x9e\xfd\xe1\xed\xe9";
> # convert to utf8
> my $utftxt = Encode::decode( 'CP1250', $wintxt );
This far is good.
> print "<br>utf8:$utftxt, " uppercase utf8: ", uc($utftxt), "\n";
No, you've missed a step: read the code I posted again. You can't just
print character data to a filehandle: you'll get 'Wide character in
print' warnings, and you'll get output in perl's internal data format,
which is an incomprehensible mixture of ISO8859-1 and UTF8. You hace to
convert the characters back into bytes, using any encoding of your
choice.
Ben
|
|
Posted by Petr Vileta on February 27, 2008, 8:25 pm
Please log in for more thread options Ben Morrow wrote:
>> My data source is in cp1250 so I *really* need to use this codepage.
>
> You data source (input) and your output do not need to be in the same
> encoding.
>
Yes, and must NOT be ;-) Maybe my English is too poor. I wanted to say I must
use cp1250 because my source data is coded in this codepage, but output I want
to be in utf8.
> 1. ead in binary data (make sure you use binmode on your
> filehandles, btw),
I'm not sure if I must to do it every time. My data going from $ARGV, from
param() /CGI module/, from disk file or from LWP module.
> 2. convert that binary data into characters, using the CP1250
> encoding (Encode::decode),
Sure
> 3. uc or lc those characters,
This is what I was asking for ;-)
> 4. convert them back into binary data, using any encoding of your
> choice (Encode::encode),
> 5. write that data out to a filehandle (again, make sure you use
> binmode).
Really I must do it? My output is to browser, in other word my script is cgi
script on web server (Linux/Apache).
I tested to do output without Encode::encode and this work as I need.
>> my $utftxt = Encode::decode( 'CP1250', $wintxt );
>
> This far is good.
>
>> print "<br>utf8:$utftxt, " uppercase utf8: ", uc($utftxt), "\n";
>
> No, you've missed a step: read the code I posted again. You can't just
> print character data to a filehandle: you'll get 'Wide character in
> print' warnings, and you'll get output in perl's internal data format,
> which is an incomprehensible mixture of ISO8859-1 and UTF8. You hace
> to convert the characters back into bytes, using any encoding of your
> choice.
Hmm, curious, but this work for me. IMHO all characters in range \x20 - \x7f
are on the same position in utf8 code, right? And all national characters
\x80 - \xff was be converted to utf8 by Encode::decode( 'CP1250', $wintxt ).
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to <petr AT practisoft DOT cz>
|
| Similar Threads | Posted | | How to NOT use utf8. | February 18, 2005, 9:51 pm |
| decoding utf8 | September 22, 2005, 8:22 am |
| UTF8 to ASCII? | January 5, 2006, 3:49 pm |
| utf8 filenames | April 9, 2006, 8:39 pm |
| Problem with DBD::DB2 and UTF8. | April 14, 2006, 11:31 am |
| lwp and utf8 characters | September 2, 2006, 3:31 am |
| How to find utf8.enc? | November 27, 2006, 12:34 pm |
| Imager with UTF8 | November 25, 2008, 10:31 am |
| why utf8::upgrade is needed? | July 10, 2004, 12:39 pm |
| How to convert latin1 to utf8 | February 25, 2005, 8:53 am |
|