[RFC] File::SplitStream - iterate over files >2GB when large filesupport unavailable

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I am implementing a module and seek the community's input about its
suitability for placement on CPAN.  File::SplitStream (I am open to
better names) is designed to be used when an OS supports large files,
but the Perl interpreter does not have large file support enabled
(specifically, Red Hat Linux did this for awhile).  It uses the Unix
split command to split the large file into <2GB chunks, the generates an
iterator to allow the calling routine to transparently read the file
chunks as if they are still one large file.

Below is a draft for the documentation for this module.  I searched CPAN
and did not find anything similar.  Your input and suggestions regarding
structure, functionality, and documentation improvements will be greatly

      File::SplitStream - iterate over multiple files as if they were one
    file.  Optionally split a large file into smaller files before

      use File::SplitStream;

      # split a file into parts
      my $filestream = new File::SplitStream;
      $filestream->genFileStream() || die("cannot generate filestream: $!");


      # or use a group of pre-existing files
      my @inputfiles = qw(file01.txt file02.txt file03.txt);
      my $filestream = new File::SplitStream;

      # regardless of how you set things up, you can
      # now iterate over the files as if they're one file
      while (my $line = $filestream->nextLine()->() ) {
              ...do stuff on each line of all of the files...


      # you can use a function call rather than instantiating an object
      use File::SplitStream qw(genFileStream);

      my $filestream = genFileStream('/path/to/inputfile', 19000000);
      while ( my $line = $filestream->() ) {
              ...do stuff on each line of all of the files...

     File::SplitStream can be used to split a large text file (or
     to use a list of pre-existing files) and iterate over the files as if
     they were a single file. This class is designed to help work with large
     files (>2GB) when large file support is unavailable. Perhaps the
     programmer does not have permissions to recompile the available Perl
     interpreter, or simply does not have the time. Regardless of reason,
     this module can help fill in the gap when large file support is

     In order for File::SplitStream to work properly, the Unix split and cat
     commands should be in your $PATH. The split command is used to split up
     the large file into more manageable chunks, while the cat command is
     used to buffer input of the files. The number of lines in each of your
     split files will depending on how much data is in each line. Shorter
     lines will allow you to put many more lines into a file before it
     crosses the 2GB barrier. Longer lines will require you to decrease the
     lines/file value.

     These accessor methods can be used directly or set by passing them to
     the new() method.

      (set to split a single file apart and iterate over the pieces)
       file   the file to split apart
       lines  maximum number of lines in each file chunk

      (set to use a pre-existing set of files as a single file)
       files  reference to a list of files to iterate over

     Use the new() method to create a new File::SplitStream object. You will
     need to do this to use the module in an object-oriented way. You can
     pass options to the new() method to set the file(), lines(), and


      # new File::SplitStream with no options
      my $fss = new File::SplitStream;

      # new File::SplitStream with options to split a single file
      my $fss = new File::SplitStream(FILE => '/path/to/file', LINES =>

      # new File::SplitStream with option to use a list of pre-existing
      my $fss = new File::SplitStream(FILES => ['/path/to/file1',
                     ] );

     If options are passed to new(), init() is invoked by new() to set the
     appropriate object attributes given the options. Normally init() is
     invoked by new(), but can be used to (re)set your File::SplitStream
     object's attributes if you want.


      my $fss = new File::SplitStream;
      $fss->init(FILE => '/path/to/file', LINES => 15000000);

   genFileStream($filepath, $number_of_lines)
     The workhorse of File::SplitStream is the genFileStream()
     method/function. It splits the large data file (if necessary) using the
     Unix split command, then generates an iterator function to return each
     line of the split files in order, transparently opening and closing the
     split files as necessary. If you have specified a list of pre-existing
     files, the iterator will open each in the order you gave.

     In an object-oriented context, genFileStream() will take the values of
     the file() and lines() accessors (or the files() accessor in the
case of
     pre-existing files) as its parameters. If you explicitly pass
     genFileStream() parameters, these will override the object's
     In a procedural context, obviously you will have to explicitly pass
     these parameters.

     In object-oriented style, genFileStream() will assign the iterator
     function to the nextLine() accessor and return 1 to the calling
     this way the calling routine does not need yet another variable to hold
     the "filestream." In procedural style, the iterator will be returned to
     the calling routine.

     When the data in all of the files have been exhausted, the iterator
     function will return undef. If there is a problem generating the
     iterator (usually a problem with the split), or a problem is
     while the split files are being read, the program will die() with the
     error being written to STDERR.


       # OO way
       use File::SplitStream;
       my $fss = new File::SplitStream;
       while ( $line = $fss->nextLine()->() ) {
         ...process the file...

       # procedural
       use File::SplitStream qw(genFileStream);
       my $stream = genFileStream('/data/largefile.dat',1000000);
       while ( $line = $stream->() ) {
             ...process the file...

     None by default. You can import genFileStream() into your namespace if
     you wish to use it in procedural style.

Re: [RFC] File::SplitStream - iterate over files >2GB when largefile support unavailable

AJ wrote:

Quoted text here. Click to load it

This seems rather complex and involves very big temporary files.

Is there some problem with just doing...

open my $fh, '-|, 'cat', $huge_file or die "Cannot read $huge_file: $!";

Re: [RFC] File::SplitStream - iterate over files >2GB when largefile support unavailable

Brian McCauley wrote:
Quoted text here. Click to load it
  Yes.  In my case, it didn't work.  I received a 'File too large' error
after the input pipe passed the 2GB limit.  Thus, this solution.
Obviously, it's not very pretty, but it does work.

Re: [RFC] File::SplitStream - iterate over files >2GB when large file support unavailable

Quoted text here. Click to load it

I don't understand the need for this.  It doesn't appear to implement
"seek" and "tell", only streaming.  It has been a while since I've used
a small-file perl, but I never knew there was a problem in streaming large
files in the first place.  I thought it was only seek and tell (and
truncate, and maybe other non-streaming things) which elicited the problem.


-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service                        $9.95/Month 30GB

Re: [RFC] File::SplitStream - iterate over files >2GB when largefile support unavailable

xhoster@gmail.com wrote:
Quoted text here. Click to load it

I can tell you that seek() and tell() are not the only things that don't
work when trying to access a large file without large file support
enabled.  In my original case, merely trying to open the file in
question (~20GB size) yielded a "File too large" error immediately.
Rewriting my code to cat the file through a pipe worked until I read
past the 2GB threshold; at that point, the "File too large" error
resurfaced.  This system is using an older OS whose perl was not
compiled with large file support enabled and, if given the chance, I
would have upgraded the Perl (and the OS, for that matter).  But for
several reasons I am unable to do this.  A solution similar to this
module (though not using the same code) seemed to provide the necessary
workaround.  My thought was, if I experienced this problem, others might
too.  It may be messy, since you're having to carve up a file and double
your required disk space, but it *works*, and in a situation like that,
*working* may be exactly what you need.

I should also point out the module does not *have* to split the original
file up; it can work from a list of files that are already separate for
whatever reason (autorotated log files come to mind).  Sure, you can
just cat them, but what if their total size is >2GB?  Without large file
support, the perl interpreter will give up after it has read past the
2GB threshold.  This module will prevent that from happening.  Again,
this is a very specific set of circumstances that ideally one would
avoid.  But if you're in such a position, as I was recently, having a
module to give you a helping hand would be a very good thing.

Site Timeline