|
Posted by xhoster on June 3, 2008, 12:00 pm
Please log in for more thread options
hadzio@gmail.com wrote:
> Hi,
>
> I have the following issue. I have a directory with 25000 text files
> in it (about 1-10000 lines in each file).
How big are the files? Every character needs to be inspected to see if it
is a newline, so it depends on the number of characters, not the number
of lines.
> I have a perl script that
> generates some reports for me and this script needs to count the
> number of lines in each of these 25000 files (for each file I need a
> number of lines in it). And it is not so difficult, I iterate over the
> directory and count the number of lines using "wc -l" as follows:
>
> open (WCCOUNT, "cat $file_to_read | wc -l |");
> $file_number_of_lines = <WCCOUNT>;
> chomp($file_number_of_lines);
> close(WCCOUNT);
>
> But the above sequencial counting is very slow (2-3 hours). My server
> is quite powerfull (72 CPU and fast filesystems)
Unless your line lengths are > 1000 or so, this seems quite slow. Maybe
your file system isn't as fast you think it is. Can you verify the time
it would take to read all of this data independent of both Perl and wc
(e.g. catting it all to /dev/null). I think that should be sorted out
before trying to go for parallelization. If you have IO problems,
parallelization probably wouldn't help
Also, what kind of CPUs do you have?
Xho
--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
|