Click here to get back home

cockroach race: grep for characters in any order

 HomeNewsGroups | Search | About
 comp.lang.perl.misc    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
cockroach race: grep for characters in any order Ben Bullock 06-19-2008
Posted by Ben Bullock on June 19, 2008, 8:10 am
Please log in for more thread options
Following on from the discussion about finding all of a set of
characters in a string, here is a "cockroach race" I've made to see
which solution is faster. I ran this on Perl 5.10, so you might get
different results with some other version.

#!/usr/local/bin/perl
use warnings;
use strict;
use List::MoreUtils qw/all/;

sub bullock
{
my ($s, @chars) = @_;
my $anychar = join '', @chars;
my $matchany = join '.*',("[$anychar]") x @chars;
if ($s =~ /$matchany/) {
        my $copy = $s;
        for my $c (@chars) {
         return 0 unless $copy =~ s/$c//g;
        }
        return 1;
}
return 0;
}

sub lalli
{
my ($s, @chars) = @_;
return all { $s =~ /$_/ } @chars;
}

sub jackman
{
my ($s, @chars) = @_;
my @re;
for (@chars) {
        my $re = $_;
        for my $c (@chars) {
         next if $c eq $_;
         $re .= "(?=.*$c)";
        }
        push @re, $re;
}
my $re = join '|', @re;
return $s =~ /$re/;
}

sub j_index
{
my ($s, @chars) = @_;
my $matched = 1;
for my $char (@chars) {
        return 0 if index($s, $char) == -1;
}
return 1;
}

sub j_b
{

my ($s, @chars) = @_;
my $re;
for my $c (@chars) {
        $re .= "(?=.*$c)";
}
return $s =~ /$re/;
}

sub j_b2
{

my ($s, @chars) = @_;
my $re = '^';
for my $c (@chars) {
        $re .= "(?=.*$c)";
}
return $s =~ /$re/;
}

use Benchmark qw( cmpthese );

sub comparethem
{
my ($text, $chars, $ret) = @_;
my %chars = map {$_ => 1} split ('', $chars);
my @chars = sort keys %chars;
my $count = 100000;
cmpthese $count,
{
'bullock' => sub {bullock($text,@chars) == $ret or die '1'},
'lalli' => sub {lalli($text,@chars) == $ret or die '2'},
'jackman' => sub {jackman($text,@chars) == $ret or die '3'},
'j_index' => sub {j_index($text,@chars) == $ret or die '4'},
'j_b' => sub {j_b($text,@chars) == $ret or die '5'},
'j_b2' => sub {j_b($text,@chars) == $ret or die '6'},
};
}

comparethem ("naninuneno", 'aiueo', 1);
comparethem ("naninuneni", 'aiueo', 0);
comparethem ("naninuneni", 'abcdefghijklmnopqrstuvwxyz', 0);
comparethem ('abcdefghijklmnopqrstuvwxyz'x100, 'aeiou', 1);

__END__

The clear winner on all four tests is Glenn Jackman's version
using "index":

Rate bullock jackman lalli j_b j_b2 j_index
bullock 20000/s -- -17% -41% -74% -74% -84%
jackman 24213/s 21% -- -29% -68% -68% -80%
lalli 34014/s 70% 40% -- -55% -55% -72%
j_b 75758/s 279% 213% 123% -- -0% -38%
j_b2 75758/s 279% 213% 123% 0% -- -38%
j_index 121951/s 510% 404% 259% 61% 61% --
Rate jackman bullock lalli j_b j_b2 j_index
jackman 21053/s -- -10% -37% -72% -72% -85%
bullock 23419/s 11% -- -30% -69% -69% -83%
lalli 33670/s 60% 44% -- -56% -56% -76%
j_b 75758/s 260% 223% 125% -- -0% -45%
j_b2 75758/s 260% 223% 125% 0% -- -45%
j_index 138889/s 560% 493% 313% 83% 83% --
Rate jackman j_b2 j_b lalli bullock j_index
jackman 1386/s -- -93% -94% -95% -95% -97%
j_b2 20921/s 1409% -- -4% -18% -25% -52%
j_b 21786/s 1472% 4% -- -15% -22% -50%
lalli 25641/s 1750% 23% 18% -- -8% -42%
bullock 27855/s 1910% 33% 28% 9% -- -36%
j_index 43860/s 3064% 110% 101% 71% 57% --
Rate bullock jackman lalli j_b2 j_b j_index
bullock 3557/s -- -78% -86% -89% -89% -96%
jackman 16234/s 356% -- -35% -50% -50% -83%
lalli 25063/s 605% 54% -- -23% -24% -74%
j_b2 32680/s 819% 101% 30% -- -0% -66%
j_b 32787/s 822% 102% 31% 0% -- -66%
j_index 95238/s 2577% 487% 280% 191% 190% --

However, second place is not so clear. Despite what Bart Lateur
thought, there is no difference between the performance of the
anchored and unanchored regexes (j_b and j_b2 above). The solution I
posed initially fails badly on a long input string. The regex solution
posted by Glenn Jackman fails particularly badly on a long list of
characters, because of the O(n^2) size of the regex. However, the
thing I initially posted does miles better with a long input string,
coming a close second.



Posted by Ben Bullock on June 19, 2008, 8:54 am
Please log in for more thread options
On Thu, 19 Jun 2008 12:10:34 +0000, Ben Bullock wrote:

> 'j_index' => sub {j_index($text,@chars) == $ret or die '4'},
> 'j_b' => sub {j_b($text,@chars) == $ret or die '5'},
> 'j_b2' => sub {j_b($text,@chars) == $ret or die '6'},
^
2

Ooops!


> However, second place is not so clear. Despite what Bart Lateur
> thought, there is no difference between the performance of the
> anchored and unanchored regexes (j_b and j_b2 above).

As Bart Lateur thought, adding the anchor ^ increased the speed quite a
bit, especially with the negative matches (the second and third ones):

Rate j_b j_b2 j_index
j_b 72993/s -- -7% -42%
j_b2 78740/s 8% -- -37%
j_index 125000/s 71% 59% --
Rate j_b j_b2 j_index
j_b 70423/s -- -11% -49%
j_b2 79365/s 13% -- -43%
j_index 138889/s 97% 75% --
Rate j_b j_b2 j_index
j_b 20704/s -- -18% -54%
j_b2 25126/s 21% -- -45%
j_index 45455/s 120% 81% --
Rate j_b j_b2 j_index
j_b 31348/s -- -3% -62%
j_b2 32468/s 4% -- -61%
j_index 83333/s 166% 157% --


Posted by Paul Lalli on June 19, 2008, 9:54 am
Please log in for more thread options
> Following on from the discussion about finding all of a set of
> characters in a string, here is a "cockroach race" I've made to see
> which solution is faster. I ran this on Perl 5.10, so you might get
> different results with some other version.

[...]

> However, second place is not so clear. Despite what Bart Lateur
> thought, there is no difference between the performance of the
> anchored and unanchored regexes (j_b and j_b2 above). The solution I
> posed initially fails badly on a long input string. The regex solution
> posted by Glenn Jackman fails particularly badly on a long list of
> characters, because of the O(n^2) size of the regex. However, the
> thing I initially posted does miles better with a long input string,
> coming a close second.

Added lalli_idx to use index() rather than pattern matches in my
suggestion (and corrected the j_b/j_b2 typo):

sub lalli_idx
{
my ($s, @chars) =3D @_;
return all { index($s, $_) !=3D -1 } @chars;
}

Running on 5.8.4 for solaris:


Rate bullock lalli jackman lalli_idx j_b2
j_b j_index
bullock 14599/s -- -15% -23% -76% -78%
-78% -85%
lalli 17212/s 18% -- -9% -71% -74%
-74% -82%
jackman 19011/s 30% 10% -- -68% -71%
-72% -81%
lalli_idx 60241/s 313% 250% 217% -- -8%
-10% -39%
j_b2 65789/s 351% 282% 246% 9% --
-2% -33%
j_b 67114/s 360% 290% 253% 11% 2%
-- -32%
j_index 98039/s 572% 470% 416% 63% 49%
46% --
Rate jackman bullock lalli j_b lalli_idx
j_b2 j_index
jackman 17452/s -- -2% -13% -67% -72%
-74% -85%
bullock 17857/s 2% -- -11% -66% -71%
-73% -84%
lalli 20040/s 15% 12% -- -62% -67%
-70% -83%
j_b 52910/s 203% 196% 164% -- -14%
-21% -54%
lalli_idx 61350/s 252% 244% 206% 16% --
-9% -47%
j_b2 67114/s 285% 276% 235% 27% 9%
-- -42%
j_index 114943/s 559% 544% 474% 117% 87%
71% --
Rate jackman j_b lalli j_b2 bullock
lalli_idx j_index
jackman 1200/s -- -94% -94% -95% -96%
-97% -98%
j_b 20877/s 1640% -- -2% -6% -37%
-39% -57%
lalli 21322/s 1677% 2% -- -4% -36%
-38% -56%
j_b2 22321/s 1760% 7% 5% -- -33%
-35% -54%
bullock 33223/s 2669% 59% 56% 49% --
-4% -31%
lalli_idx 34483/s 2774% 65% 62% 54% 4%
-- -29%
j_index 48309/s 3926% 131% 127% 116% 45%
40% --
Rate bullock jackman j_b2 j_b lalli
lalli_idx j_index
bullock 2818/s -- -75% -83% -83% -83%
-95% -97%
jackman 11086/s 293% -- -31% -32% -33%
-79% -87%
j_b2 16129/s 472% 45% -- -1% -3%
-70% -81%
j_b 16234/s 476% 46% 1% -- -2%
-70% -81%
lalli 16556/s 488% 49% 3% 2% --
-69% -80%
lalli_idx 53763/s 1808% 385% 233% 231% 225%
-- -37%
j_index 84746/s 2908% 664% 425% 422% 412%
58% --



Similar ThreadsPosted
[regex] grep for chars in any order June 18, 2008, 3:06 am
perlipc doc / another race (?) June 27, 2005, 1:33 pm
Minimizing chance of race conditions November 15, 2004, 9:17 pm
Adding a delimiter inbetween number characters and letter characters January 12, 2005, 9:25 am
What does 'grep -M' do? April 14, 2007, 8:42 pm
Grep Help November 16, 2007, 12:56 am
Grep -v option November 4, 2004, 3:05 pm
using grep and diff May 10, 2005, 10:13 am
Grep in parts of an XML June 23, 2005, 12:48 am
Need help with grep function April 28, 2006, 11:13 am

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap