# fields separation - Page 2

•  Subject
• Author
• Posted on

## Re: fields separation

You never provided a complete description of the problem this code was
supposed to solve and the actual problem kept changing its spots in line
with whatever was necessary to make you latest 'silly hack' shine.

Here's a hint why you're latest still sucks except when making suitable
assumptions about what the input data will be:

---------
use Benchmark qw(cmpthese);

my \$t = 'A' x 16;
my @a = (\$t) x 512;
push(@a, 'bb');

cmpthese(-3,
{
half => sub {
my \$x;

for (@a / 2 .. \$#a) {
\$x .= \$a[\$_];
\$x =~ /(.*?)bb/ and last;
}
},

quarter => sub {
my \$x;

for (3 * @a / 4 .. \$#a) {
\$x .= \$a[\$_];
\$x =~ /(.*?)bb/ and last;
}
},

all => sub {
my \$x;

for (0 .. \$#a) {
\$x .= \$a[\$_];
\$x =~ /(.*?)bb/ and last;
}
}});
---------

You'll have to interpret that yourself this time.

## Re: fields separation

[...]

[...]

},

[...]

},

[...]

With apologies to everyone to whom this is self-evident but since there
are probably people to whom it isn't:

while (read \$fh, my \$rawdata, \$buffsize)
{

[...]
my @fields = "\$leftbehind\$rawdata" =~/(.*?)\$sep/g;

if (@fields)
{
\$leftbehind = \$';#'
# call the Callback function
}
else
{
\$leftbehind .= \$rawdata
}
}

This piece of 'applied ingenuity' has one glaring problem, namely, it
will scan the remainder again on the next iteration. This implies that
it will perform extremly badly on input which doesn't conform to the
assumption it has been 'designed' around, namely, that the average field
size is much smaller than the input block and that the regex will thus
spend most of its time with finding fields instead of with scanning data
already scanned during the last iteration. This effect is also
cumulative, eg, for a token whose size is n * \$buffsize, the total
amount of work will be proportional to

\$buffsize + 2 * \$buffsize + 3 * \$buffsize + ... + n * \$buffsize;
= \$buffsize * (1 + 2 + 3 ... + n)
= \$buffsize * ( (n * (n + 1)) / 2)
= \$buffsize * ((n**2 + n) / 2)
= (\$buffsize / 2) * (n**2 + n)

ie, it will be proportional to the square of the number of blocks.

A perl one-liner a la

perl -e '\$x = 61; print("A" x \$x, "<*>"), \$x *= 3 for (0 .. \$ARGV[0])'

(first parameter should be the number of tokens to generate)

could be used to create input conforming to George's
pseudo-specification which will effectively perform an algorithmic
complexity attack on this code (the constants have been chosen such that
the token size will always be relatively prime to the block size to
prevent the algorithm from 'catching up').

NB: This can obviously be avoided by reading the entire input into
memory and scanning it only once.