Click here to get back home

Frequency in large datasets

 HomeNewsGroups | Search | About
 comp.lang.perl.misc    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Frequency in large datasets Cosmic Cruizer 04-30-2008
Posted by Cosmic Cruizer on April 30, 2008, 10:15 pm
Please log in for more thread options
I've been able to reduce my dataset by 75%, but it still leaves me with a
file of 47 gigs. I'm trying to find the frequency of each line using:

open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
$!";
foreach (<TEMP>) {
$seen++;
}
close(TEMP) || die "cannot close file
$tempfile: $!";

My program keeps aborting after a few minutes because the computer runs out
of memory. I have four gigs of ram and the total paging files is 10 megs,
but Perl does not appear to be using it.

How can I find the frequency of each line using such a large dataset? I
tried to have two output files where I kept moving the databack and forth
each time I grabbed the next line from TEMP instead of using $seen++,
but I did not have much success.

Posted by Gunnar Hjalmarsson on April 30, 2008, 10:24 pm
Please log in for more thread options
Cosmic Cruizer wrote:
> I've been able to reduce my dataset by 75%, but it still leaves me with a
> file of 47 gigs. I'm trying to find the frequency of each line using:
>
> open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
> $!";
> foreach (<TEMP>) {
> $seen++;
> }
> close(TEMP) || die "cannot close file
> $tempfile: $!";
>
> My program keeps aborting after a few minutes because the computer runs out
> of memory.

This line:

> foreach (<TEMP>) {

reads the whole file into memory. You should read the file line by line
instead by replacing it with:

while (<TEMP>) {

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Posted by xhoster on April 30, 2008, 10:39 pm
Please log in for more thread options
> Cosmic Cruizer wrote:
> > I've been able to reduce my dataset by 75%, but it still leaves me with
> > a file of 47 gigs. I'm trying to find the frequency of each line using:
> >
> > open(TEMP, "< $tempfile") || die "cannot open file
> > $tempfile: $!";
> > foreach (<TEMP>) {
> > $seen++;
> > }
> > close(TEMP) || die "cannot close file
> > $tempfile: $!";
> >
> > My program keeps aborting after a few minutes because the computer runs
> > out of memory.
>
> This line:
>
> > foreach (<TEMP>) {
>
> reads the whole file into memory. You should read the file line by line
> instead by replacing it with:
>
> while (<TEMP>) {

Duh, I completely overlooked that.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Posted by Cosmic Cruizer on April 30, 2008, 11:32 pm
Please log in for more thread options

> Cosmic Cruizer wrote:
>> I've been able to reduce my dataset by 75%, but it still leaves me
>> with a file of 47 gigs. I'm trying to find the frequency of each line
>> using:
>>
>> open(TEMP, "< $tempfile") || die "cannot open file
>> $tempfile:
>> $!";
>> foreach (<TEMP>) {
>> $seen++;
>> }
>> close(TEMP) || die "cannot close file
>> $tempfile: $!";
>>
>> My program keeps aborting after a few minutes because the computer
>> runs out of memory.
>
> This line:
>
>> foreach (<TEMP>) {
>
> reads the whole file into memory. You should read the file line by
> line instead by replacing it with:
>
> while (<TEMP>) {
>

<sigh> As both you and Sinan pointed out... I'm using foreach. Everywhere
else I used the while statement to get me to this point. This solves the
problem.

Thank you.

Posted by Chris Mattern on May 1, 2008, 12:42 pm
Please log in for more thread options
>
>> Cosmic Cruizer wrote:
>>> I've been able to reduce my dataset by 75%, but it still leaves me
>>> with a file of 47 gigs. I'm trying to find the frequency of each line
>>> using:
>>>
>>> open(TEMP, "< $tempfile") || die "cannot open file
>>> $tempfile:
>>> $!";
>>> foreach (<TEMP>) {
>>> $seen++;
>>> }
>>> close(TEMP) || die "cannot close file
>>> $tempfile: $!";
>>>
>>> My program keeps aborting after a few minutes because the computer
>>> runs out of memory.
>>
>> This line:
>>
>>> foreach (<TEMP>) {
>>
>> reads the whole file into memory. You should read the file line by
>> line instead by replacing it with:
>>
>> while (<TEMP>) {
>>
>
><sigh> As both you and Sinan pointed out... I'm using foreach. Everywhere
> else I used the while statement to get me to this point. This solves the
> problem.
>
> Thank you.

Didn't realize your file had so many duplicates (and thus such a small
set of unique lines). If it works, that's great!


--
Christopher Mattern

NOTICE
Thank you for noticing this new notice
Your noticing it has been noted
And will be reported to the authorities

Similar ThreadsPosted
FAQ 6.14: How can I print out a word-frequency or line-frequency summary? November 6, 2004, 6:03 am
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? February 23, 2005, 12:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? May 1, 2005, 11:03 am
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? July 17, 2005, 4:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? October 17, 2005, 10:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? April 29, 2006, 3:03 am
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? August 21, 2006, 3:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? November 15, 2006, 3:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? March 7, 2007, 3:03 pm
FAQ 6.14 How can I print out a word-frequency or line-frequency summary? May 22, 2007, 3:03 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap