FAQ 6.22: How can I match strings with multibyte characters?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

This message is one of several periodic postings to comp.lang.perl.misc
intended to make it easier for perl programmers to find answers to
common questions. The core of this message represents an excerpt
from the documentation provided with Perl.


6.22: How can I match strings with multibyte characters?

    Starting from Perl 5.6 Perl has had some level of multibyte character
    support. Perl 5.8 or later is recommended. Supported multibyte character
    repertoires include Unicode, and legacy encodings through the Encode
    module. See perluniintro, perlunicode, and Encode.

    If you are stuck with older Perls, you can do Unicode with the
    "Unicode::String" module, and character conversions using the
    "Unicode::Map8" and "Unicode::Map" modules. If you are using Japanese
    encodings, you might try using the jperl 5.005_03.

    Finally, the following set of approaches was offered by Jeffrey Friedl,
    whose article in issue #5 of The Perl Journal talks about this very

    Let's suppose you have some weird Martian encoding where pairs of ASCII
    uppercase letters encode single Martian letters (i.e. the two bytes "CV"
    make a single Martian letter, as do the two bytes "SG", "VS", "XX",
    etc.). Other bytes represent single characters, just like ASCII.

    So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
    nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.

    Now, say you want to search for the single character "/GX/". Perl
    doesn't know about Martian, so it'll find the two bytes "GX" in the "I
    am CVSGXX!" string, even though that character isn't there: it just
    looks like it is because "SG" is next to "XX", but there's no real "GX".
    This is a big problem.

    Here are a few ways, all painful, to deal with it:

       $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian''
                                          # bytes are no longer adjacent.
       print "found GX!\n" if $martian =~ /GX/;

    Or like this:

       @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
       # above is conceptually similar to:     @chars = $text =~ m/(.)/g;
       foreach $char (@chars) {
           print "found GX!\n", last if $char eq 'GX';

    Or like this:

       while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
           print "found GX!\n", last if $1 eq 'GX';

    Here's another, slightly less painful, way to do it from Benjamin
    Goldberg, who uses a zero-width negative look-behind assertion.

            print "found GX!\n" if  $martian =~ m/

    This succeeds if the "martian" character GX is in the string, and fails
    otherwise. If you don't like using (?<!), a zero-width negative
    look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]).

    It does have the drawback of putting the wrong thing in $-[0] and $+[0],
    but this usually can be worked around.


Documents such as this have been called "Answers to Frequently
Asked Questions" or FAQ for short.  They represent an important
part of the Usenet tradition.  They serve to reduce the volume of
redundant traffic on a news group by providing quality answers to
questions that keep coming up.

If you are some how irritated by seeing these postings you are free
to ignore them or add the sender to your killfile.  If you find
errors or other problems with these postings please send corrections
or comments to the posting email address or to the maintainers as
directed in the perlfaq manual page.

Note that the FAQ text posted by this server may have been modified
from that distributed in the stable Perl release.  It may have been
edited to reflect the additions, changes and corrections provided
by respondents, reviewers, and critics to previous postings of
these FAQ. Complete text of these FAQ are available on request.

The perlfaq manual page contains the following copyright notice.


    Copyright (c) 1997-2002 Tom Christiansen and Nathan
    Torkington, and other contributors as noted. All rights

This posting is provided in the hope that it will be useful but
does not represent a commitment or contract of any kind on the part
of the contributers, authors or their agents.

Site Timeline