Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I want the fastest way to grub the fields from a huge ; delimited file.
The fields can be randomly quoted or not. currently I use the following

use strict;
use warnings;

my $regex_split   = qr/^(.*?);(.*?);(.*?);(.*?)$/o;
my $regex_dequote = qr/^"([^"\]++|\.)*+"$/o;

while (<DATA>) {
my @col = $_ =~ $regex_split or die;
s/$regex_dequote/$1/ foreach @col;
print "@col\n"


Re: fields

Quoted text here. Click to load it

Obviously there are additional restrictions for your data, otherwise
your solution below would yield wrong results.

Quoted text here. Click to load it

So, obviously there are always 4 fields. You forgot to mention that in
your specification.
And the data itself never contains a semicolon. You forgot to mention
that, too.

Quoted text here. Click to load it

This looks overly complicated but maybe I am just missing the forest for
the trees. Wouldn't a simple    
    my $regex_dequote = qr/^"(.*)"$/o;
work just as well?
Remember: REs are expensive. So try to use as simple REs as possible.

Quoted text here. Click to load it

Given those additional restrictions a simple
    my @col = split /;/ , $_;
will probably be faster because the RE is so much simpler.
If you insist on testing for exactly 4 fields then you can just check
the length of @col.

However, in any case: I doubt that splitting the file into its lines and
individual fields is actually the bottle neck. Reading the file from an
external device like a HD is probably much slower than such simple text

And I would always value correctness above speed and therefore use one
of the tried and time-tested Text::CSV modules.

Quoted text here. Click to load it


Re: fields

Τη Δευτέρα, 19 Ια?
?ουαρίου 2015 - 4:15:31 π.μ.
 UTC+2, ο χρήστης Jï¿?
?rgen Exner έγραψε:
Quoted text here. Click to load it

1) actually the fields are a lot more, this is only the core info
2) my regex gives always correct results (but is is slow)
3) split is statistical (much) slower than the regex
4) you regex is giving wrong results e.g

   my ($var) = '"hello"er"' =~ /^"(.*)"$/;
   print $var;


Re: fields

Quoted text here. Click to load it

Didn't we already have this last time? An algorithm which is
more tuned to the actual data in question will be faster than one
designed to do well for more general cases. Eg, for the information
below, I doubt that 'a solution' (in Perl) can be much faster than

print <<T

It won't work for any other input, but "there ain't no such thing as a
free lunch".


Quoted text here. Click to load it

The /o is pointless here as the qr// is evaluated only once,
anyway. Further, as already determined in the past, even when actually
interpolating something into the regex, qr// isn't particularly fast. Using
it is absolutely ridicolous for static regexes,

use Benchmark qw(cmpthese);

my $a = 'ab' x 15;
my $re = qr/bab$/;

       qr => sub {
           $a =~ /$re/;

       re => sub {
           $a =~ /bab$/;

Site Timeline