Click here to get back home

Frequency in large datasets

 HomeNewsGroups | Search | About
 comp.lang.perl.misc    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Frequency in large datasets Cosmic Cruizer 04-30-2008
Posted by A. Sinan Unur on May 1, 2008, 7:26 am
Please log in for more thread options
benkasminbullock@gmail.com (Ben Bullock) wrote in

>>

...

>>> foreach (<TEMP>) {
>>
>> Well, that is simply silly. You have a huge file yet you try to read
>> all of it into memory. Ain't gonna work.
>
> I'm not sure why it's silly as such - perhaps he didn't know that
> "foreach" would read all the file into memory.

Well, I assumed he didn't. But this is one of those things, had I found
myself doing it, after spending hours and hours trying to work out a way
of processing the file, I would have slapped my forehead and said, "now
that was just a silly thing to do". Coupled with the "ain't" I assumed
my meaning was clear. I wasn't calling the OP names, but trying to get a
message across very strongly.

>> If the number of unique lines is small relative to the number of
>> total lines, I do not see any difficulty if you get rid of the
>> boneheaded for loop.
>
> Again, why is it "boneheaded"?

Because there is no hope of anything working so long as that for loop is
there.

> The fact that foreach reads the entire file into memory isn't
> something I'd expect people to know automatically.

Maybe this helps:

From perlfaq3.pod:

<blockquote>
* How can I make my Perl program take less memory?

...

Of course, the best way to save memory is to not do anything to waste it
in the first place. Good programming practices can go a long way toward
this:

* Don't slurp!

Don't read an entire file into memory if you can process it line by
line. Or more concretely, use a loop like this:
</blockquote>

Maybe you would like to read the rest.

So, calling the for loop boneheaded is a little stronger than "Bad
Idea", but then what is simply a bad idea with a 200 MB file (things
will still work but less efficiently) is boneheaded with a 47 GB file
(there is no chance of the program working).

There is a reason "Don't slurp!" appears with an exclamation mark and as
the first recommendation in the FAQ list answer.

Hope this helps you become more comfortable with the notion that reading
a 47 GB file is a boneheaded move. It is boneheaded if I do it, if Larry
Wall does it, if Superman does it ... you get the picture I hope.

Sinan

--
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Posted by nolo contendere on May 1, 2008, 11:54 am
Please log in for more thread options
net.ne.jp:
>
>
> Hope this helps you become more comfortable with the notion that reading
> a 47 GB file is a boneheaded move. It is boneheaded if I do it, if Larry
> Wall does it, if Superman does it ... you get the picture I hope.
>

I don't think it would be boneheaded if Superman did it...I mean, he's
SUPERMAN.

Posted by Chris Mattern on May 1, 2008, 12:43 pm
Please log in for more thread options
>> benkasminbull...@gmail.com (Ben Bullock) wrote
>>
>>
>> Hope this helps you become more comfortable with the notion that reading
>> a 47 GB file is a boneheaded move. It is boneheaded if I do it, if Larry
>> Wall does it, if Superman does it ... you get the picture I hope.
>>
>
> I don't think it would be boneheaded if Superman did it...I mean, he's
> SUPERMAN.

Hey, Superman can do boneheaded things. It's not like he's Chuck Norris.


--
Christopher Mattern

NOTICE
Thank you for noticing this new notice
Your noticing it has been noted
And will be reported to the authorities

Posted by xhoster on April 30, 2008, 10:38 pm
Please log in for more thread options
> I've been able to reduce my dataset by 75%, but it still leaves me with a
> file of 47 gigs. I'm trying to find the frequency of each line using:
>
> open(TEMP, "< $tempfile") || die "cannot open file
> $tempfile: $!";
> foreach (<TEMP>) {
> $seen++;
> }
> close(TEMP) || die "cannot close file
> $tempfile: $!";

If each line shows up a million times on average, that shouldn't
be a problem. If each line shows up twice on average, then it won't
work so well with 4G of RAM. We don't which of those is closer to your
case.

> My program keeps aborting after a few minutes because the computer runs
> out of memory. I have four gigs of ram and the total paging files is 10
> megs, but Perl does not appear to be using it.

If the program is killed due to running out of memory, then I would
say that the program *does* appear to be using the available memory. What
makes you think it isn't using it?


> How can I find the frequency of each line using such a large dataset?

I probably wouldn't use Perl, but rather the OS's utilities. For example
on linux:

sort big_file | uniq -c


> I
> tried to have two output files where I kept moving the databack and forth
> each time I grabbed the next line from TEMP instead of using $seen++,
> but I did not have much success.

But in line 42.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Posted by Jürgen Exner on April 30, 2008, 11:44 pm
Please log in for more thread options
>I've been able to reduce my dataset by 75%, but it still leaves me with a
>file of 47 gigs. I'm trying to find the frequency of each line using:
>
> open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
>$!";
> foreach (<TEMP>) {

This slurps the whole file (yes, all 47GB) inot a list and then iterates
over that list. Read the file line-by-line instead:

        while (<TEMP>){

This should work unless you have a lot of different data points.

jue

Similar ThreadsPosted
FAQ 6.14: How can I print out a word-frequency or line-frequency summary? November 6, 2004, 6:03 am
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? February 23, 2005, 12:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? May 1, 2005, 11:03 am
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? July 17, 2005, 4:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? October 17, 2005, 10:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? April 29, 2006, 3:03 am
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? August 21, 2006, 3:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? November 15, 2006, 3:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? March 7, 2007, 3:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? May 22, 2007, 3:03 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap