|
Posted by maria on February 29, 2008, 9:53 pm
Please log in for more thread options On Fri, 29 Feb 2008 10:50:10 +0000, RedGrittyBrick
>Petr Vileta wrote:
>> maria wrote:
>>> I am using a CGI program to read XML files and extract their various
>>> items. Somehow, my program converts the apostrophe "’" to ...
>>> "\â\?\T". How do I program my CGI program to convert "’" to
>>> an apostrophe, "'"? Is there a little CGI code that will convert
>>> all these different strings (including dagger, ellipsis,
>>> euro symbol, double quote, etc.) to their ASCII equivalents?
>>> Thank you very much.
>>>
>> You can use s/// for this.
>>
>> my $xml = 'some maria’s text';
>> $xml =~ s/’/'/sg;
>> print $xml;
>>
>
>I know you're answering the question as asked, but I think there's more
>to the problem than that.
>
>Maria's problem is expressed a bit vaguely but let's assume that her XML
>file does indeed contain the eight ASCII-character sequence ’
>Something in her Perl program is clearly converting this to a single
>character \u2019 using UTF-8 encoding. She is then causing this to be
>misrepresented to the browser as Windows Latin-1.
>
>Presumably the translation of the numeric entity reference to a Unicode
>character is done by an XML module Maria is using to read the file. If
>so, you'd have to stop that conversion happening or amend your
>substitution statement accordingly.
>
>Also I'm not sure what "ASCII equivalent" you'd propose for Maria's
>dagger, ellipsis and Euro symbols :-)
>
>s/†/+/sg; ?
>s/…/.../sg; ?
>s/€/EUR/sg; ?
>
>Suppose her XML might contain asterism, per-mille and other punctuation?
>The list of substitutions would get rather cumbersome. Maria might
>struggle to think up or locate ASCII equivalents for some of these. The
>results may be rather ugly and confusing to readers.
>
>Why bother when XML, Perl, HTML, HTTP and web-browsers can all handle
>UTF-8 characters properly?
>
>Personally I'd not make ASCII substitutions, IE7 and FireFox2 can render
>Unicode general punctuation symbols, whether encoded as numeric entities
>or as UTF-8 characters. I guess the same is true of most mainstream
>browsers.
I also forgot to tell you that the MagPie program actually has a
PHP-command, "define('MAGPIE_OUTPUT_ENCODING', 'UTF-8');" that will
resolve several of the common problems involving the conversion of
special characters, etc.
Thanks!
maria
|