File size too big for perl processing

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hi, I posted this to perl.beginners as well and will make sure
comments go to both groups.

I have a big file of 16-letter words that I am using as "bait" to
capture larger words in a raw data file.  I loop through all of the
rawdata with a single word for 1) matches and 2) to associate the raw
data with the word.  I then go to the next line in the word list and

hashsequence16.txt is the 16-letter word file (203MB)
rawdata.txt is the raw data file (93MB)

I have a counter in the code to tell me how long it's taking to
process... 9500 hours or so to complete...  I definitely have time to
pursue other alternatives.

Scripting with perl is a hobby and not a vocation so I apologize in
advance for ugly code.  Any suggestions/comments would be greatly



print "**fisher**";

$flatfile = "newrawdata.txt";
# 95MB in size

$datafile = "hashsequence16.txt";
# 203MB in size

my $filesize = -s "hashsequence16.txt";
# for use in processing time calculation

open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";
open (SEQFILE, ">fishersearch.txt") || die "Can't open '$seqparsed': $!

@preparse = <FILE>;
@hashdata = <FILE2>;


for my $list1 (@hashdata) {
# iterating through hash16 data


    if ($finish ==10 ) {
# line counter

    $marker = $marker + $finish;

    $finish =0;

    $left = $filesize - $marker;

    printf "$left\/$filesize\n";
# this prints every 17 seconds

    ($line, $freq) = split(/\t/, $list1);

    for my $rawdata (@preparse) {
# iterating through rawdata

    $rawdata=~ s/\n//;

    if ($rawdata =~ m/$line/) {
# matching hash16 word with rawdata line

        my $first_pos = index  $rawdata,$line;

        print SEQFILE "$first_pos\t$rawdata\n";
# printing to info to new file



    print SEQFILE "PROCESS\t$line\n";
# printing hash16 word and "process"


Re: File size too big for perl processing

In article

Quoted text here. Click to load it

Hmm. How many 16-letter words are in this file? I see from your code
that the file contains the word and a frequency count. Estimating at
about 25 bytes per word, that represents 8 million words.

Quoted text here. Click to load it

You should have

use strict;
use warnings;

in your program. This is very important if you wish to get help from
this newsgroup.

Quoted text here. Click to load it

You should be using lexically-scoped file handle variables, the
3-argument version of open, and 'or' instead of '||'.

Quoted text here. Click to load it

Well at least you have enough memory to read the files into memory.
That helps. If you apply the chomp operator to these arrays, you can
save yourself some repetitive processing later:


Quoted text here. Click to load it

When you are asking for help, it is best to leave out irrelevant
details such as periodic printing statements. It doesn't help anybody
help you.

Quoted text here. Click to load it

No need for this if you chomp the arrays after reading.

Quoted text here. Click to load it

You first use a regex to find if $line appears in $rawdata, then use
index to find out where it appears. Just test the return value from
index to see if the substring appears. It will be -1 if it does not.
This will give you a significant speed-up.

Quoted text here. Click to load it

You only make one pass through FILE2, so you can save some memory by
processing the contents of this file one line at a time, instead of
reading it into the @hashdata array. It looks like you could also swap
the order of the for loops and only make one pass through FILE,
instead, but that may take more memory.

It is difficult to see why this program will take 9500 hours to run.
Make the above changes and try again. Without your data files or a look
at some sample data, it is difficult for anyone to really help you.

Jim Gibson

Re: File size too big for perl processing

Cheez wrote:
Quoted text here. Click to load it

The subject implies that you have a problem that is producing an E2BIG
error (say a file > 2GB or, even, 2^63 bytes - that would be impressive).

In fact you seem to have an slow algorithm that you expect can be
improved.  That is something very different.

              Just because I've written it doesn't mean that
                   either you or I have to believe it.

Site Timeline