Click here to get back home

Counting lines in big number of files - in parallel.

 HomeNewsGroups | Search | About
 comp.lang.perl.misc    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Counting lines in big number of files - in parallel. hadzio 06-03-2008
Posted by xhoster on June 3, 2008, 12:00 pm
Please log in for more thread options
hadzio@gmail.com wrote:
> Hi,
>
> I have the following issue. I have a directory with 25000 text files
> in it (about 1-10000 lines in each file).

How big are the files? Every character needs to be inspected to see if it
is a newline, so it depends on the number of characters, not the number
of lines.

> I have a perl script that
> generates some reports for me and this script needs to count the
> number of lines in each of these 25000 files (for each file I need a
> number of lines in it). And it is not so difficult, I iterate over the
> directory and count the number of lines using "wc -l" as follows:
>
> open (WCCOUNT, "cat $file_to_read | wc -l |");
> $file_number_of_lines = <WCCOUNT>;
> chomp($file_number_of_lines);
> close(WCCOUNT);
>
> But the above sequencial counting is very slow (2-3 hours). My server
> is quite powerfull (72 CPU and fast filesystems)

Unless your line lengths are > 1000 or so, this seems quite slow. Maybe
your file system isn't as fast you think it is. Can you verify the time
it would take to read all of this data independent of both Perl and wc
(e.g. catting it all to /dev/null). I think that should be sorted out
before trying to go for parallelization. If you have IO problems,
parallelization probably wouldn't help

Also, what kind of CPUs do you have?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Posted by Ben Bullock on June 4, 2008, 9:56 am
Please log in for more thread options
On Tue, 03 Jun 2008 02:14:26 -0700, hadzio wrote:

> Or maybe someone has any other better idea how to count number of
> lines in each of 25000 files?

I wonder why you need to keep counting these files again and again. Do
all the files change each time you need to look at them? If not, you
could store the results of counting in a file, and then only recount the
ones which have been modified since the results file was last updated.



Similar ThreadsPosted
Re: Counting lines in big number of files - in parallel. June 5, 2008, 2:34 pm
regroup several lines into one by counting parenthesis November 14, 2005, 1:43 am
counting number of uniques in a multidimensional array column July 25, 2006, 2:00 pm
Re: counting the number of characters that were matched in a regular expression April 16, 2008, 3:20 pm
Re: counting the number of characters that were matched in a regular expression April 16, 2008, 3:42 pm
counting number of empty strings in a multidimensional array column July 29, 2006, 11:03 am
Counting most frequently-occurring n-grams in a file (or over multiple files) September 24, 2004, 6:27 pm
FAQ 5.3: How do I count the number of lines in a file? December 4, 2004, 6:03 am
FAQ 5.3: How do I count the number of lines in a file? January 6, 2005, 12:03 am
FAQ 5.3 How do I count the number of lines in a file? January 30, 2005, 12:03 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap