|
Posted by xhoster on April 30, 2008, 10:38 pm
Please log in for more thread options > I've been able to reduce my dataset by 75%, but it still leaves me with a
> file of 47 gigs. I'm trying to find the frequency of each line using:
>
> open(TEMP, "< $tempfile") || die "cannot open file
> $tempfile: $!";
> foreach (<TEMP>) {
> $seen++;
> }
> close(TEMP) || die "cannot close file
> $tempfile: $!";
If each line shows up a million times on average, that shouldn't
be a problem. If each line shows up twice on average, then it won't
work so well with 4G of RAM. We don't which of those is closer to your
case.
> My program keeps aborting after a few minutes because the computer runs
> out of memory. I have four gigs of ram and the total paging files is 10
> megs, but Perl does not appear to be using it.
If the program is killed due to running out of memory, then I would
say that the program *does* appear to be using the available memory. What
makes you think it isn't using it?
> How can I find the frequency of each line using such a large dataset?
I probably wouldn't use Perl, but rather the OS's utilities. For example
on linux:
sort big_file | uniq -c
> I
> tried to have two output files where I kept moving the databack and forth
> each time I grabbed the next line from TEMP instead of using $seen++,
> but I did not have much success.
But in line 42.
Xho
--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
|