Need suggestions to improve code

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I am working on some data crunching/migration of data. Data is in postgresql database.  

The rows of data from where I am reading are more than 20 milions.
While working on this data with normal (linear way) it is taking more than 2 days time to handle this data :(  

I thought of using Parrallel::ForkManager Perl module.  Please help me optimize/improve following program.  

use strict;
use warnings
use DBI;
use Parallel::ForkManager;

my $MAX = 100;
my $pm = Parallel::ForkManager->new($MAX);

my $dbh =  DBI->connect( "DBI:Pg:dbname=" . 'postgres' . ";host=" . '', "postgres", "postgres" ) or die DBI->errstr;

my $query = $dbh->prepare("select id from secrete_records");


while ( my $record =  $query->fetchrow_hashref() ) {

 my $pid = $pm->start and next;

 # creating new Database connection  
 my $dbh_internal =  DBI->connect( "DBI:Pg:dbname=" . 'postgres' . ";host=" . '', "postgres", "postgres" ) or die DBI->errstr;
 my $query_internal = $dbh_internal("select id, record_one, record_two from secrete_records where id =?")
   // do some activity (includes insert, update in secrete_records_archives and other relavent tables from same database






I am creating multiple connections with Database ($dbh, $dbh_internal). Not sure if that's required. I tried to use one connection but it trows error. Its related to SSL :( though I am not using database connection in SSL mode.  

Any other ways to handle such large number of data in Perl?  

thank you,


Re: Need suggestions to improve code


Quoted text here. Click to load it

The speed improvement you can achieve in this way depends on what the
computer (or computers) you are using can actually do in
parallell. For compute-bound parts of the appliaction (according to
the code below, there aren't any), it can roughly do as many things in
parallell as there are available CPUs/ cores and for I/O-bound parts,
as there are available, independent 'I/O processing thingies' ie
disks or SSDs. Once you get past either number, more threads of
execution running the code will actually slow things down as they
block each other while competing for resources and the overhead for
switching among them increases.

Quoted text here. Click to load it

Which means that, except in relatively rare situations, 100 is
certainly way too much.

Quoted text here. Click to load it

Is the database running on another host? If so, you should run your
code there. Otherwise, using AF_UNIX sockets to connect to the server
should be somewhat faster than using a (mock) TCP connection.

Quoted text here. Click to load it

Another obvious idea would be: Move the processing into the database
in order avoid copying the data backwards and forwards between your
program and the database server(s). Postgres can be programmed in a
variety of languages, Perl being among them.

You should also consider tuning the database configuration, in
particular, the 'fsync' related parts and the various buffer settings
whose defaults were presumably selected for some VAX running 4.3BSD in
1989 or so (this means they're ridicolously low for any even remotely
modern computer).

Re: Need suggestions to improve code

On 08/24/13 01:01, SSS Develop wrote:
Quoted text here. Click to load it

That is 11 records per second, which seems to be astonishing slow.  Why
is it so slow?  Figure that out.  Then either fix it, or use that
knowledge to do parallelization in an appropriate way.

Quoted text here. Click to load it

$MAX of 100 seems awfully high, unless you have remarkable hardware.
"The beatings will continue until morale improves."

Quoted text here. Click to load it

You are creating a new connection for every record.  While that
shouldn't take nearly 100 ms per connection and so it might not be rate  
limiting in your case, it is still going to be pretty inefficient.  
Alas, Parallel::ForkManager doesn't facilitate anything else, so maybe  
ForkManager isn't the right thing to use.

Quoted text here. Click to load it

Syntax error.

execute does not accept a hash reference.

Quoted text here. Click to load it

2 million is a fairly small number, not a large number.  Figure out  
*why* it is so slow, then speed it up.


Site Timeline