Click here to get back home

Matching multiple subexpressions in a regular expression

 HomeNewsGroups | Search | About
 comp.lang.perl.misc    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Matching multiple subexpressions in a regular expression ShaunJ 03-11-2008
Posted by ShaunJ on March 11, 2008, 7:49 pm
Please log in for more thread options
If more than one subepxression of a regex could have matched the
string, is it possible to find all the subexpressions that could have
matched? Example...

my $re = qr/([AC]CT)|(AC[CT])/;
'CCT' =~ m/$re/;
print join ',', @-; print "\n";
'ACC' =~ m/$re/;
print join ',', @-; print "\n";
'ACT' =~ m/$re/;
print join ',', @-; print "\n";

Output:
0,0
0,,0
0,0

This shows that for...
CCT: the first subexpression matched
ACC: the second subexpression matched
ACT: the first subexpression matched

However, ACT matched both subexpressions! The ideal result for ACT
would be...
0,0,0
showing that both subexpressions matched. Is this possible without
having to split each subexpression into its own regular expression? My
understanding -- please correct me if I'm wrong -- is that one big
regular expression will run faster than 100 little ones, since the one
big regular expression can be compiled into a single large finite-
state-machine that is more efficient than running 100 little FSM.

Thanks!
Shaun

Posted by John W. Krahn on March 11, 2008, 9:13 pm
Please log in for more thread options
ShaunJ wrote:
> If more than one subepxression of a regex could have matched the
> string, is it possible to find all the subexpressions that could have
> matched? Example...
>
> my $re = qr/([AC]CT)|(AC[CT])/;
> 'CCT' =~ m/$re/;
> print join ',', @-; print "\n";
> 'ACC' =~ m/$re/;
> print join ',', @-; print "\n";
> 'ACT' =~ m/$re/;
> print join ',', @-; print "\n";
>
> Output:
> 0,0
> 0,,0
> 0,0
>
> This shows that for...
> CCT: the first subexpression matched
> ACC: the second subexpression matched
> ACT: the first subexpression matched
>
> However, ACT matched both subexpressions! The ideal result for ACT
> would be...
> 0,0,0
> showing that both subexpressions matched.

Perl's regular expressions can't do that. They always stop after a
successful match so either ([AC]CT) would match or (AC[CT]) would match
but never both.

> Is this possible without
> having to split each subexpression into its own regular expression? My
> understanding -- please correct me if I'm wrong -- is that one big
> regular expression will run faster than 100 little ones,

Your assumption is incorrect.

> since the one
> big regular expression can be compiled into a single large finite-
> state-machine that is more efficient than running 100 little FSM.

That question is answered in perlfaq6:

perldoc -q "How do I efficiently match many regular expressions at once"



John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

Posted by ShaunJ on March 12, 2008, 4:25 pm
Please log in for more thread options
...
> > since the one
> > big regular expression can be compiled into a single large finite-
> > state-machine that is more efficient than running 100 little FSM.
>
> That question is answered in perlfaq6:
>
> perldoc -q "How do I efficiently match many regular expressions at once"

Hi John,

If I structure my program as in the example, using many small regex
instead of one big regex, Perl 5.8.6 runs out of memory and dies:
vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
characters each, and the input string is one line 100 kB long. The
machine has 2 GB of memory and free disk space, which should be
enough, so I presume the code is somehow leeking memory. It's only a
dozen or so lines long, so I've posted my code below. Can you see an
obvious leak?

Thanks,
Shaun

my @restrings = <REFILE>;
my @re = map { qr/$_/x } @restrings;
while (<>) {
        my $i = 0;
        foreach my $re (@re) {
                $i++;
                pos = 0;
                while (m/$re/g) {
                        print $i, "\t",
                                $LAST_MATCH_START[0] + 1, "\t",
                                $&, "\n";
                        pos = $LAST_MATCH_START[0] + 1;
                }
        }
}

Posted by Frank Seitz on March 12, 2008, 10:40 pm
Please log in for more thread options
ShaunJ wrote:

> If I structure my program as in the example, using many small regex
> instead of one big regex, Perl 5.8.6 runs out of memory and dies:
> vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
> characters each, and the input string is one line 100 kB long.
[...]
> my @restrings = <REFILE>;
> my @re = map { qr/$_/x } @restrings;

400'000 precompiled regexes are quite a lot!
Why don't you read and create them in chunks of, say, 1000?

Frank
--
Dipl.-Inform. Frank Seitz; http://www.fseitz.de/
Anwendungen für Ihr Internet und Intranet
Tel: 04103/180301; Fax: -02; Industriestr. 31, 22880 Wedel

Posted by ShaunJ on March 13, 2008, 12:57 pm
Please log in for more thread options
> ShaunJ wrote:
> > If I structure my program as in the example, using many small regex
> > instead of one big regex, Perl 5.8.6 runs out of memory and dies:
> > vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
> > characters each, and the input string is one line 100 kB long.
> [...]
> > my @restrings = <REFILE>;
> > my @re = map { qr/$_/x } @restrings;
>
> 400'000 precompiled regexes are quite a lot!
> Why don't you read and create them in chunks of, say, 1000?

Hi Frank,

400'000 * 27 = 12 MB. I wouldn't say it's that large. It seems
reasonable that the compiled regex would fit in less than one GB of
memory, which is my only requirement.

In any case, the following 9-line code snippet burns through 100MB of
memory a second using Perl 5.8.6! Certainly a memory leak. The only
explanation I can think of is if the m/$re/g expression were
recompiling the regex every time and the previously compiled regex
weren't being discarded.

Thanks,
Shaun

my @restrings = <REFILE>;
my @re = map { qr/$_/x } @restrings;
while (<>) {
        foreach my $re (@re) {
                while (m/$re/g) {
                        print $LAST_MATCH_START[0], "\n";
                }
        }
}

Similar ThreadsPosted
matching HTML expression across multiple lines May 10, 2006, 4:31 pm
regular expression, matching sub item January 30, 2006, 6:04 pm
matching a complicated url in a regular expression January 13, 2007, 12:32 pm
matching chunks of data with a regular expression August 26, 2004, 7:02 am
Regular Expression check for non matching string September 22, 2005, 8:27 am
Variable interpolation and m in regular expression matching January 22, 2008, 8:42 am
Assigning a value to pos to control regular expression matching May 13, 2008, 5:15 pm
Regular expression for matching words containing underscore _ character December 12, 2007, 10:27 am
Help with Pattern matching. Matching multiple lines from while reading from a file. May 2, 2007, 11:37 pm
matching multiple ip addresses May 4, 2005, 11:58 am

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap