FAQ 6.23 How can I match strings with multibyte characters?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
This is an excerpt from the latest version perlfaq6.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at http://faq.perl.org .


6.23: How can I match strings with multibyte characters?

    Starting from Perl 5.6 Perl has had some level of multibyte character
    support. Perl 5.8 or later is recommended. Supported multibyte character
    repertoires include Unicode, and legacy encodings through the Encode
    module. See perluniintro, perlunicode, and Encode.

    If you are stuck with older Perls, you can do Unicode with the
    "Unicode::String" module, and character conversions using the
    "Unicode::Map8" and "Unicode::Map" modules. If you are using Japanese
    encodings, you might try using the jperl 5.005_03.

    Finally, the following set of approaches was offered by Jeffrey Friedl,
    whose article in issue #5 of The Perl Journal talks about this very

    Let's suppose you have some weird Martian encoding where pairs of ASCII
    uppercase letters encode single Martian letters (i.e. the two bytes "CV"
    make a single Martian letter, as do the two bytes "SG", "VS", "XX",
    etc.). Other bytes represent single characters, just like ASCII.

    So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
    nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.

    Now, say you want to search for the single character "/GX/". Perl
    doesn't know about Martian, so it'll find the two bytes "GX" in the "I
    am CVSGXX!" string, even though that character isn't there: it just
    looks like it is because "SG" is next to "XX", but there's no real "GX".
    This is a big problem.

    Here are a few ways, all painful, to deal with it:

            # Make sure adjacent "martian" bytes are no longer adjacent.
            $martian =~ s/([A-Z][A-Z])/ $1 /g;

            print "found GX!\n" if $martian =~ /GX/;

    Or like this:

            @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
            # above is conceptually similar to:     @chars = $text =~ m/(.)/g;
            foreach $char (@chars) {
            print "found GX!\n", last if $char eq 'GX';

    Or like this:

            while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
                    print "found GX!\n", last if $1 eq 'GX';

    Here's another, slightly less painful, way to do it from Benjamin
    Goldberg, who uses a zero-width negative look-behind assertion.

            print "found GX!\n" if  $martian =~ m/

    This succeeds if the "martian" character GX is in the string, and fails
    otherwise. If you don't like using (?<!), a zero-width negative
    look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]).

    It does have the drawback of putting the wrong thing in $-[0] and $+[0],
    but this usually can be worked around.


The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in

Re: FAQ 6.23 How can I match strings with multibyte characters?

PerlFAQ Server acrit :
Quoted text here. Click to load it
Quoted text here. Click to load it


As I try to write a piece of code nor for USA --or any other english
speaking country-- nor for mars, so I do not use nor ASCII nor Martian
encoding. As this software should work on Earth it just need to handle
characters humans use, I mean unicode characters.

In practice,
I would like to search for a sequence of unicode characters
(independently of technical details such as the number of bytes a
character is encoded in), from a perl script written in perl language.
Searched sequences take a form such as
=AB=E0 v=E9r=BB with or without tilde/accent, and with a variable number =
of spaces.

For instance, following sequences should match:
-=E0 v=E9r
-=E0 ver
-a ver
-a v=E9r
-=E0     v=E9r

To match those patterns I writed both regular expressions in Perl langage=

This seems working, however...
however when script is obfuscated with stunnix perl ofuscator original
Perl code is translated  to folloing Perl code which does not seam workin=

So, here is my question: is it possible to search for unicode characters
from a Perl script obfuscated by stunnix tool?

Re: FAQ 6.23 How can I match strings with multibyte characters?

Quoted text here. Click to load it

Looks like the stunnix perl ofuscator doesn't convert the \N notation
correctly. This should read:


(plus the missing parenthesis, of course). Report the bug to stunnix.


Re: FAQ 6.23 How can I match strings with multibyte characters?


Quoted text here. Click to load it

If you think you should use Stunnix, check out my book Mastering Perl
where I show everyone how to defeat it.

Re: FAQ 6.23 How can I match strings with multibyte characters?

brian d foy acrit :
Quoted text here. Click to load it

Thanks for this information.

The Don't be evil company (NASDAQ: GOOG, LSE: GGEA) let me read s=
pages of this book (=ABMastering Perl=BB).

For a technical point of view, this book confirms what I thinked and
what you claims: Stunnix perl obfuscator has some technical
However, it might keep usefull as it looks like obfuscated symbols names
are not technically reversible.

I never said this tool should/might be used. I just said I have to
workaround its unpredictable bugs and limitations.

For a legal point of view, I assume the program keep being protected as
some western countries legislations and some licenses might forbid
reverse engineering on not owned software.

Re: FAQ 6.23 How can I match strings with multibyte characters?


Quoted text here. Click to load it

Well, you don't get the original names back, but there are ways to make
the names readable, which I also showed, as I recall.

The point is to not waste your time. For Stunnix to be useful, your
code has to be hard to follow to start with. If you're using good
programming techniques with short scopes, minimally-tasked subroutines,
and so on, changing variable names is just a speed bump. If you don't
want to use good practices, well, do whatever you like then. :)

Site Timeline