|
Posted by RedGrittyBrick on June 3, 2008, 7:18 am
Please log in for more thread options
hadzio@gmail.com wrote:
> Hi,
>
> I have the following issue. I have a directory with 25000 text files
> in it (about 1-10000 lines in each file). I have a perl script that
> generates some reports for me and this script needs to count the
> number of lines in each of these 25000 files (for each file I need a
> number of lines in it). And it is not so difficult, I iterate over the
> directory and count the number of lines using "wc -l" as follows:
>
> open (WCCOUNT, "cat $file_to_read | wc -l |");
> $file_number_of_lines = <WCCOUNT>;
> chomp($file_number_of_lines);
> close(WCCOUNT);
>
1) Useless use of cat
"cat $file_to_read | wc -l |" starts two processes (plus shell etc)
I'd use "wc -l $file_to_be read"
2) I suspect you are invoking this 25000 times it would be much more
efficient to invoke it once this
my %filelines
open (my $fh, '-|', 'wc -l *.txt')
or die "can't open wc because $!";
while(<$fh>) {
chomp;
my ($filename, $lines) = split;
$filelines = $lines;
# or do something else with $filename & lines
# to avoid iterating over a hash later
}
close $fh;
Untested - caveat emptor.
> But the above sequencial counting is very slow (2-3 hours). My server
> is quite powerfull (72 CPU and fast filesystems) so I would like to
> run the counting in parallel (eg. counting in 72 files at the same
> time). So the questions are:
> 1) Is it possible to run the above command (cat ... | wc -l) in
> background & (the same way as in shell) and receive the returned
> results when it is finished.
Yes
> 2) Is it possible to implement 1) without threads?
Yes, you might use processes. In either case, use a limited size pool
(e.g. the 72 you suggested) and queueing. 25000 threads or 25000
processes would be silly.
> 3) The above code I may write using system() instead of open(), but
> the same issue is: how to do it in parallel.
There are CPAN modules for this.
>
> Or maybe someone has any other better idea how to count number of
> lines in each of 25000 files? Maye someone may recommend me some other
> solution. Thank you in advance.
I'd let wc do them all at once and see if that is fast enough. It will
certainly be faster than invoking cat and wc 25000 times.
If you are checking for file content changes I'd use stat instead to
check mtime, at least as a first step.
--
RGB
|