Click here to get back home

Counting lines in big number of files - in parallel.

 HomeNewsGroups | Search | About
 comp.lang.perl.misc    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Counting lines in big number of files - in parallel. hadzio 06-03-2008
Get Chitika Premium
Posted by hadzio on June 3, 2008, 5:14 am
Please log in for more thread options
Hi,

I have the following issue. I have a directory with 25000 text files
in it (about 1-10000 lines in each file). I have a perl script that
generates some reports for me and this script needs to count the
number of lines in each of these 25000 files (for each file I need a
number of lines in it). And it is not so difficult, I iterate over the
directory and count the number of lines using "wc -l" as follows:

                open (WCCOUNT, "cat $file_to_read | wc -l |");
                $file_number_of_lines = <WCCOUNT>;
                chomp($file_number_of_lines);
                close(WCCOUNT);

But the above sequencial counting is very slow (2-3 hours). My server
is quite powerfull (72 CPU and fast filesystems) so I would like to
run the counting in parallel (eg. counting in 72 files at the same
time). So the questions are:
1) Is it possible to run the above command (cat ... | wc -l) in
background & (the same way as in shell) and receive the returned
results when it is finished.
2) Is it possible to implement 1) without threads?
3) The above code I may write using system() instead of open(), but
the same issue is: how to do it in parallel.

Or maybe someone has any other better idea how to count number of
lines in each of 25000 files? Maye someone may recommend me some other
solution. Thank you in advance.

Regards
Pawel

Posted by RedGrittyBrick on June 3, 2008, 7:18 am
Please log in for more thread options
hadzio@gmail.com wrote:
> Hi,
>
> I have the following issue. I have a directory with 25000 text files
> in it (about 1-10000 lines in each file). I have a perl script that
> generates some reports for me and this script needs to count the
> number of lines in each of these 25000 files (for each file I need a
> number of lines in it). And it is not so difficult, I iterate over the
> directory and count the number of lines using "wc -l" as follows:
>
>                 open (WCCOUNT, "cat $file_to_read | wc -l |");
>                 $file_number_of_lines = <WCCOUNT>;
>                 chomp($file_number_of_lines);
>                 close(WCCOUNT);
>

1) Useless use of cat
"cat $file_to_read | wc -l |" starts two processes (plus shell etc)
I'd use "wc -l $file_to_be read"

2) I suspect you are invoking this 25000 times it would be much more
efficient to invoke it once this

my %filelines
open (my $fh, '-|', 'wc -l *.txt')
or die "can't open wc because $!";
while(<$fh>) {
chomp;
my ($filename, $lines) = split;
$filelines = $lines;
# or do something else with $filename & lines
# to avoid iterating over a hash later
}
close $fh;

Untested - caveat emptor.

> But the above sequencial counting is very slow (2-3 hours). My server
> is quite powerfull (72 CPU and fast filesystems) so I would like to
> run the counting in parallel (eg. counting in 72 files at the same
> time). So the questions are:
> 1) Is it possible to run the above command (cat ... | wc -l) in
> background & (the same way as in shell) and receive the returned
> results when it is finished.

Yes


> 2) Is it possible to implement 1) without threads?

Yes, you might use processes. In either case, use a limited size pool
(e.g. the 72 you suggested) and queueing. 25000 threads or 25000
processes would be silly.


> 3) The above code I may write using system() instead of open(), but
> the same issue is: how to do it in parallel.

There are CPAN modules for this.

>
> Or maybe someone has any other better idea how to count number of
> lines in each of 25000 files? Maye someone may recommend me some other
> solution. Thank you in advance.

I'd let wc do them all at once and see if that is fast enough. It will
certainly be faster than invoking cat and wc 25000 times.

If you are checking for file content changes I'd use stat instead to
check mtime, at least as a first step.

--
RGB

Posted by hadzio on June 3, 2008, 7:45 am
Please log in for more thread options
Hi,

Thank you for these remarks:

> 1) Useless use of cat
> "cat $file_to_read | wc -l |" starts two processes (plus shell etc)
> I'd use "wc -l $file_to_be read"

My command returns a value in a format easier to process ;)

> 2) I suspect you are invoking this 25000 times it would be much more
> efficient to invoke it once this
>
> my %filelines
> open (my $fh, '-|', 'wc -l *.txt')

Takes almost the same time. Invoking command is not an issue comparing
to time spent on counting lines.

> I'd let wc do them all at once and see if that is fast enough. It will
> certainly be faster than invoking cat and wc 25000 times.

Not remarkable difference.

Regards
Pawel

Posted by A. Sinan Unur on June 3, 2008, 9:18 am
Please log in for more thread options
hadzio@gmail.com wrote in news:75c7fc7a-c0c9-4b39-afb1-5a0e42149446
@x35g2000hsb.googlegroups.com:

> Hi,
>
> Thank you for these remarks:
>
>> 1) Useless use of cat
>> "cat $file_to_read | wc -l |" starts two processes (plus shell etc)
>> I'd use "wc -l $file_to_be read"
>
> My command returns a value in a format easier to process ;)

perldoc perlvar

HANDLE->input_line_number(EXPR)
$INPUT_LINE_NUMBER
$NR
$. Current line number for the last filehandle accessed.

>> 2) I suspect you are invoking this 25000 times it would be much more
>> efficient to invoke it once this

I suspect running through each file an recording the line number for that
file will be much faster. On the other hand, the files have to be read,
making this IO bound. The number of CPUs you have is pretty much
irrelevant while the number of different physical hard drives over which
the files are spread is.

If you try with a few line counters running in parallel, they may get into
each others' way because of contention for the same physical hard drive.

So, let's say, on average 5,000 lines per file, 80 characters per line and
25,000 files. That's roughly 10GB of data that have to be read for this
processing to be done.

You know, wc, at least on my system, can process multiple files at a time.

For 10,000 files with 1 - 10,000 lines of 80 characters each:

timethis wc -l file*.txt > linecounts.txt

TimeThis : Command Line : wc -l file*.txt

TimeThis : Start Time : Tue Jun 03 08:41:53 2008

TimeThis : End Time : Tue Jun 03 08:45:57 2008

TimeThis : Elapsed Time : 00:04:04.437


That was 4Gb in 4 minutes.

Can we speed that up?

My guess is no. At least not by much.

So it took 2-3 hours huh? Using 72 CPUs huh? Maybe you should have first
read the man page for wc.

Bummer.

Sinan

--
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Posted by nolo contendere on June 3, 2008, 9:33 am
Please log in for more thread options
On Jun 3, 7:45=A0am, had...@gmail.com wrote:
> Hi,
>
> Thank you for these remarks:
>
> > 1) Useless use of cat
> > "cat $file_to_read | wc -l |" starts two processes (plus shell etc)
> > I'd use "wc -l $file_to_be read"
>
> My command returns a value in a format easier to process ;)
>
> > 2) I suspect you are invoking this 25000 times it would be much more
> > efficient to invoke it once this
>
> > =A0 =A0 my %filelines
> > =A0 =A0 open (my $fh, '-|', 'wc -l *.txt')
>
> Takes almost the same time. Invoking command is not an issue comparing
> to time spent on counting lines.
>
> > I'd let wc do them all at once and see if that is fast enough. It will
> > certainly be faster than invoking cat and wc 25000 times.
>
> Not remarkable difference.
>

Try something like is. It does what you ask, but not sure if it does
what you want (i.e. don't know if this will be faster than Sinan's
solution). You can test and let us know :-). I'm sure you can figure
out how to sum the numbers. Here I just print them.


#!/usr/bin/perl

use strict; use warnings;
use Parallel::ForkManager;

$|++;

# should be I/O bound, so num_cpus doesn't matter so much
# can tune this number
my $max_procs =3D 72;

my $pm =3D new Parallel::ForkManager( $max_procs );

chomp( my $somedir =3D `pwd` );
opendir DIR, $somedir or die "can't opendir '$somedir': $!";
while ( my $f =3D readdir DIR ) {
next if $f =3D~ m/^\.\.?$/;
next if -d $f;
$pm->start and next;
print `wc -l $f`;
$pm->finish;
}
closedir DIR;
$pm->wait_all_children;



Similar ThreadsPosted
Re: Counting lines in big number of files - in parallel. June 5, 2008, 2:34 pm
regroup several lines into one by counting parenthesis November 14, 2005, 1:43 am
counting number of uniques in a multidimensional array column July 25, 2006, 2:00 pm
Re: counting the number of characters that were matched in a regular expression April 16, 2008, 3:20 pm
Re: counting the number of characters that were matched in a regular expression April 16, 2008, 3:42 pm
counting number of empty strings in a multidimensional array column July 29, 2006, 11:03 am
Counting most frequently-occurring n-grams in a file (or over multiple files) September 24, 2004, 6:27 pm
FAQ 5.3: How do I count the number of lines in a file? December 4, 2004, 6:03 am
FAQ 5.3: How do I count the number of lines in a file? January 6, 2005, 12:03 am
FAQ 5.3 How do I count the number of lines in a file? January 30, 2005, 12:03 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap