Regular expression 'c' modifier

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

Recently I see a script nuked by comparison to the C version,
and the two main causes were using bigint, which I don't
really need and 'use integer' could do the job, and don't use
the /gc modifier in a regex instead of a normal /g. I think
that the documentation of the c modifier is not very clear
about its importance. Here is a comparison:

#!/usr/bin/perl -W

use strict;
use Benchmark qw(cmpthese);

my $string = "aabc" x 8192;
my $i;

cmpthese(-3, {
         g => sub { while ($string=~/(a+)/g) { $i = $1; } },
         gc => sub { while ($string=~/(a+)/gc) { $i=$1; } },


          Rate        g       gc
g       379/s       --    -100%
gc 12105006/s 3192107%       --

Re: Regular expression 'c' modifier

Quoted text here. Click to load it

Yes, matching 0 times in a string of length 2 is much faster than
matching 8192 times in a string of length 32768.

It is always suspicious if you get such a huge speedup and it is a good
idea to check that the new code is really equivalent to the old one.


   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) |                    | Man feilt solange an seinen Text um, bis
| |   |         | die Satzbestandteile des Satzes nicht mehr
__/   | | zusammenpaßt. -- Ralph Babel

Re: Regular expression 'c' modifier

* Peter J. Holzer wrote in comp.lang.perl.misc:
Quoted text here. Click to load it

To elaborate on that, the pos() of a string is a property of the string,
and ordinarily the position would be reset on a match failure. With 'c'
the position is not reset, so after the first round through the loop the
`substr $string, pos $string` string would just be 'bc' which does not
match /(a+)/ so regardless of how many times `cmpthese` calls the `gc`
version, the loop body is executed only the first time if nothing resets
the string position.
Björn Höhrmann · ·
Am Badedeich 7 · Telefon: +49(0)160/4415681 ·
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 ·

Re: Regular expression 'c' modifier

El 24/11/13 22:19, Bjoern Hoehrmann escribió:
Quoted text here. Click to load it

My fault. Taking a longer string and doing only one pass, Time::HiRes
says that /gc is only sigjhtly better than /g. The results don't
change in the number of matches, as does with a sub inside cmpthese.

Fortunately I don't have to change anything in the original code, as
the results are the expected with or without regex at all. Anyway, I
want to know now the difference between a regex like
while ($string =~ /(\d+)/gc){ $i=$1; #... }
and the index/substr equivalent when the $string contains digits and
only one character \n between numbers. I have to look at m and s
modifiers, too.


Re: Regular expression 'c' modifier

El 24/11/13 22:46, gamo escribió:
Quoted text here. Click to load it

I am about to get a better result with index/substr/lenght

Time /gc = 3.466519 s.
Time ind = 3.106548 s.
Counters: 8388608, 8388608

with this code:

#!/usr/bin/perl -W

use strict;

my $string = "1123\n" x (8192 * 1024);
my $i;
my $n = chr(ord("\n"));
my ($c1, $c2);

use Time::HiRes qw(gettimeofday tv_interval);

my $t0 = [gettimeofday];
while ($string =~ /(\d+)/gc){
     $i = $1;
     $c1++ if ($i == 1123);
my $t1 = [gettimeofday];

my $j;
my $k=0;
while ($k<length($string)){
     $j = index($string,$n,$k+1);
     $i = substr($string, $k, $j-$k);
     $k += length($i)+1;
     $c2++ if ($i == 1123);
my $t2 = [gettimeofday];

print "Time /gc = ", tv_interval($t0,$t1), " s.\n";
print "Time ind = ", tv_interval($t1,$t2), " s.\n";
print "Counters: $c1, $c2\n";


But it's rather extrange to use, it's not simple.

Best regards

Site Timeline