Click here to get back home

polymorphic regex -- encoding issue

 HomeNewsGroups | Search | About
 comp.lang.perl.misc    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
polymorphic regex -- encoding issue Dale 10-18-2007
Posted by Dale on October 18, 2007, 4:28 am
Please log in for more thread options
Consider the following:

my $html_string =3D get "http://stock.narod.ru/fibo.htm";
my $russian_page =3D decode("cp1251", $html_string);
while ($russian_page =3D~ m/(=D0=A4=D0=B8=D0=B1=D0=BE=D0=BD=D0=B0=D1=87=D1=
=87=D0=B8)\s+\b(\w+)/g) {
print "$1 $2\n";
}

I get a CP1251-encoded page from a Russian site and search for words
that might follow the word =D0=A4=D0=B8=D0=B1=D0=BE=D0=BD=D0=B0=D1=87=D1=87=
=D0=B8 (Fibonacci). But isn't this bit
of code inefficient? I start right off by decoding the whole page,
where I really only need to have decoded those portions of the page
that match. So wouldn't it be better to encode the regex in CP1251 to
do the matching, and then convert any matched strings to the encoding
I want before printing out. Something like the following:

$russian_page =3D get "http://stock.narod.ru/fibo.htm";
my $search_word =3D encode("cp1251", "=D0=A4=D0=B8=D0=B1=D0=BE=D0=BD=D0=B0=
=D1=87=D1=87=D0=B8");
while ($russian_page =3D~ m/($search_word)\s+(\w+)/g) {
print decode("cp1251", "$1 $2\n");
}

This doesn't obviously fail, but it doesn't give the expected result
either. Presumably, the problem is that I've only encoded part of my
regex in CP1251. So the question is: Is there a way to change the
encoding of a regular expression?

A couple details:

Perl version:
5=2E8.8

Pragmas and modules used:
LWP::Simple
utf8;
Encode;
binmode(STDOUT, ":utf8");


Similar ThreadsPosted
Re: polymorphic regex -- encoding issue October 18, 2007, 8:36 am
Digest Authentication encoding issue March 21, 2008, 6:38 am
crash in regex with encoding utf8 January 21, 2006, 10:59 am
RegEx issue July 29, 2004, 3:45 pm
regex multi-line match/replace issue April 24, 2006, 4:18 pm
perl polymorphic behavoir ? February 7, 2006, 11:40 am
POD and =encoding ... September 6, 2004, 10:49 am
LWP and UTF-8 encoding November 23, 2005, 12:15 pm
encoding November 15, 2006, 10:42 am
=encoding September 1, 2007, 8:28 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap