Click here to get back home

Converting "’" to an Apostrophe?

 HomeNewsGroups | Search | About
 comp.lang.perl.misc    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Converting "’" to an Apostrophe? maria 02-27-2008
Get Chitika Premium
Posted by maria on February 29, 2008, 6:26 pm
Please log in for more thread options
On Fri, 29 Feb 2008 04:29:00 +0100, Gunnar Hjalmarsson

>Petr Vileta wrote:
>> Tad J McClellan wrote:
>>> The s///s modifier makes dot in your pattern match a newline.
>>>
>>> It is a no-op when your pattern does not have a dot in it.
>>
>> I have seen it many times but my book say some bit different.
>>
>> <cite>
>> Larry Wall, Tom Christiansen & Randal L. Schwartz
>> Programming in Perl (Czech translation, 1997)
>>
>> Chapter 2: Basic program parts, page 74
>>
>> Modifiers:
>> ....
>> s Work with string as with single line.
>> </cite>
>>
>> Nowhere any word about dot in regexp.
>
>That's because the book explains it just as ambigous as "perldoc perlop"
>does.
>
>AFAIK, there is one place in the perldoc, and one place only, where the
>/s modifier (and the /m modifier) are properly explained: "perldoc perlre":
>
> s Treat string as single line. That is, change "." to match any
> character whatsoever, even a newline, which normally it would
> not match.
>
>> I wrote many s/// for strings
>> containing \n and where wasn't be any dots in pattern and on some server
>> this work without /s modifier but on other not.
>
>Really? Please show us an example!

Thank you, gunnar, RedGrittyBrick, Petr and A. Sinun Unur.
I appreciate your help.
I have changed my approch. I am using Magpie and

$rss = fetch_rss($url);

instead. It works fine.

maria

Posted by RedGrittyBrick on February 29, 2008, 5:50 am
Please log in for more thread options
Petr Vileta wrote:
> maria wrote:
>> I am using a CGI program to read XML files and extract their various
>> items. Somehow, my program converts the apostrophe "&#x2019;" to ...
>> "\â\?\T". How do I program my CGI program to convert "&#x2019;" to
>> an apostrophe, "'"? Is there a little CGI code that will convert
>> all these different strings (including dagger, ellipsis,
>> euro symbol, double quote, etc.) to their ASCII equivalents?
>> Thank you very much.
>>
> You can use s/// for this.
>
> my $xml = 'some maria&#x2019;s text';
> $xml =~ s/&#x2019;/'/sg;
> print $xml;
>

I know you're answering the question as asked, but I think there's more
to the problem than that.

Maria's problem is expressed a bit vaguely but let's assume that her XML
file does indeed contain the eight ASCII-character sequence &#x2019;
Something in her Perl program is clearly converting this to a single
character \u2019 using UTF-8 encoding. She is then causing this to be
misrepresented to the browser as Windows Latin-1.

Presumably the translation of the numeric entity reference to a Unicode
character is done by an XML module Maria is using to read the file. If
so, you'd have to stop that conversion happening or amend your
substitution statement accordingly.

Also I'm not sure what "ASCII equivalent" you'd propose for Maria's
dagger, ellipsis and Euro symbols :-)

s/&#x2020/+/sg; ?
s/&#x2026/.../sg; ?
s/&#x20AC/EUR/sg; ?

Suppose her XML might contain asterism, per-mille and other punctuation?
The list of substitutions would get rather cumbersome. Maria might
struggle to think up or locate ASCII equivalents for some of these. The
results may be rather ugly and confusing to readers.

Why bother when XML, Perl, HTML, HTTP and web-browsers can all handle
UTF-8 characters properly?

Personally I'd not make ASCII substitutions, IE7 and FireFox2 can render
Unicode general punctuation symbols, whether encoded as numeric entities
or as UTF-8 characters. I guess the same is true of most mainstream
browsers.

Posted by maria on February 29, 2008, 9:53 pm
Please log in for more thread options
On Fri, 29 Feb 2008 10:50:10 +0000, RedGrittyBrick

>Petr Vileta wrote:
>> maria wrote:
>>> I am using a CGI program to read XML files and extract their various
>>> items. Somehow, my program converts the apostrophe "&#x2019;" to ...
>>> "\â\?\T". How do I program my CGI program to convert "&#x2019;" to
>>> an apostrophe, "'"? Is there a little CGI code that will convert
>>> all these different strings (including dagger, ellipsis,
>>> euro symbol, double quote, etc.) to their ASCII equivalents?
>>> Thank you very much.
>>>
>> You can use s/// for this.
>>
>> my $xml = 'some maria&#x2019;s text';
>> $xml =~ s/&#x2019;/'/sg;
>> print $xml;
>>
>
>I know you're answering the question as asked, but I think there's more
>to the problem than that.
>
>Maria's problem is expressed a bit vaguely but let's assume that her XML
>file does indeed contain the eight ASCII-character sequence &#x2019;
>Something in her Perl program is clearly converting this to a single
>character \u2019 using UTF-8 encoding. She is then causing this to be
>misrepresented to the browser as Windows Latin-1.
>
>Presumably the translation of the numeric entity reference to a Unicode
>character is done by an XML module Maria is using to read the file. If
>so, you'd have to stop that conversion happening or amend your
>substitution statement accordingly.
>
>Also I'm not sure what "ASCII equivalent" you'd propose for Maria's
>dagger, ellipsis and Euro symbols :-)
>
>s/&#x2020/+/sg; ?
>s/&#x2026/.../sg; ?
>s/&#x20AC/EUR/sg; ?
>
>Suppose her XML might contain asterism, per-mille and other punctuation?
>The list of substitutions would get rather cumbersome. Maria might
>struggle to think up or locate ASCII equivalents for some of these. The
>results may be rather ugly and confusing to readers.
>
>Why bother when XML, Perl, HTML, HTTP and web-browsers can all handle
>UTF-8 characters properly?
>
>Personally I'd not make ASCII substitutions, IE7 and FireFox2 can render
>Unicode general punctuation symbols, whether encoded as numeric entities
>or as UTF-8 characters. I guess the same is true of most mainstream
>browsers.

I also forgot to tell you that the MagPie program actually has a
PHP-command, "define('MAGPIE_OUTPUT_ENCODING', 'UTF-8');" that will
resolve several of the common problems involving the conversion of
special characters, etc.
Thanks!

maria

Similar ThreadsPosted
regular expression help with apostrophe March 14, 2005, 7:31 pm
Converting XML to CSV June 8, 2006, 10:48 am
Help converting sed script June 25, 2005, 12:03 pm
converting list to an array October 29, 2004, 3:10 pm
converting perl to sed/ C shell ? January 15, 2005, 8:29 pm
having issues using awk and/or converting to perl January 9, 2006, 6:21 am
Converting codepages to UTF8 March 30, 2006, 9:04 am
converting the contents of a scalar May 4, 2006, 2:43 pm
converting vba to perl without win32::ole September 8, 2006, 9:34 am
converting php md5 function to perl September 12, 2006, 4:16 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap