Very Sluggish Code

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I was hoping someone could help me shed some light on this.

I have a Perl script, which I will post the contents below. It seems to get
'stuck' and I have to do an actually kill of the process.

I ran it using the Perl debugger (perl -d <>) and it seems to slow
right down when reading back the array.

However, it does work - it does populate the database with the XML files as it
is supposed to - so, that much I know.

So, any assistance I can get. I would much appreciate the help.


#use strict;
use DBI;
use XML::XPath;
use XML::XPath::XMLParser;

# Set the input dir where all of the XML files live
my $input_dir = ".";

# Begin reading the directory
opendir(DIR, $input_dir) || die "sorry shit the bed $input_dir: $!";

# Read them into an array
my @files_in_dir = grep { /xml/ } readdir(DIR);
closedir DIR;

# connect to database and create parser object
my $dbh = DBI->connect ("DBI:mysql:can_us_ratecenters",
                           "xxxxx", "xxxxxxx",
                           { RaiseError => 1, PrintError => 0});

# clear the database - new DIDs coming in
$dbh->do ('TRUNCATE TABLE rc_city_town');

#Now the fun begins - read each file and put it in the database
foreach my $f (@files_in_dir) {
        open IN, "<$f";

my $xp = XML::XPath->new (filename => "./$f");
my $nodelist = $xp->find ("//row");
foreach my $row ($nodelist->get_nodelist ())
       $dbh->do (
           "INSERT IGNORE INTO rc_city_town (state_prov, city_town, did_number)
VALUES (?,?,?)",
               $row->find ("state")->string_value (),
               $row->find ("ratecenter")->string_value (),
               $row->find ("number")->string_value (),
        close IN;
$dbh->disconnect ();


Re: Very Sluggish Code

Quoted text here. Click to load it

Is the process using a huge amount of memory at that point? Is the
system swapping?

Quoted text here. Click to load it

Why is this commented out? Where is 'use warnings'?

Quoted text here. Click to load it

Keep your filehandles (and dirhandles) in real variables:

    opendir(my $DIR, $input_dir) || die ...;

Quoted text here. Click to load it

Is this the line which goes slowly? Are there a huge number of files in
that directory (whether they match or not)? Is the directory mounted
over the network? What filesystem are you using?

This line reads the entire directory into a list in memory, then
iterates over the list matching the pattern and builds another list of
the results. You can make it more efficient by reading the entries one
at a time; replace the for loop below with

    while (my $f = readdir $DIR) {
        /xml/ or next;

        # as before

Quoted text here. Click to load it

Use a variable for the filehandle, as above. Also use 3-arg open, and
check the return value.

    open my $IN, "<", $f or die ...;

Quoted text here. Click to load it

You open the file and then do nothing with the filehandle.

Quoted text here. Click to load it

This will in principle go faster if you use ->prepare and ->execute
rather than ->do. I don't actually know whether DBD::mysql supports
server-side prepared statements, so it's possible this will make no
difference in practice.

Quoted text here. Click to load it

3-arg open - was Re: Very Sluggish Code


Quoted text here. Click to load it

Hi Ben, that's the second time I have seen you advocate 3-arg open. I
think I now understand using a variable for the filehandle but I fail
to see the difference between "<foo" and "<", "foo".

Dave Saville

Re: 3-arg open - was Re: Very Sluggish Code

Quoted text here. Click to load it

In that case there is none. In general, though, 3-arg is safer; suppose
you write

    open my $FOO, "<$foo" or die ...;

and $foo happens to contain " foo". In this case 2-arg open will swallow
the space and open the wrong file.


Re: 3-arg open - was Re: Very Sluggish Code


Quoted text here. Click to load it

Logically, the open mode and the pathname to open are two different
things and putting them into the same string argument which either the
compiler or the runtime environment then need to take apart again by
parsing it wasn't a good idea: The 'open mode' characters Perl uses
are usually perfectly valid although practically somewhat 'rare' filename
characters as well.

Somewhat contrived example:

sub open_for_reading($)
    my $fh;

    open($fh, '<'.$_[0]) // die("open: $_[0]: $!");
    return $fh;

my $fh0;

$fh0 = open_for_reading('&STDIN');
This won't try to open the file named &STDIN for reading but will dup
the STDIN file handle instead. Given a situation like this one, it is
also somewhat stupid to first execute code to concatenate name and
mode and then execute code to take them apart again.

Re: 3-arg open - was Re: Very Sluggish Code


Quoted text here. Click to load it

Additional remark: Very likely, this idea came from the UNIX(*)
(Bourne) shell which does the same thing (and mostly uses the same
'open mode' characters). But while the shell provides a very useful
programming language, being able to use it without Extreme Pain[tm]
relies on users of the system exercising 'wise restraint' when
creating named objects in the file system, ie, deliberatly avoiding
names the parser of the shell might trip over. If this cannot be
guaranteed, eg, because these users have either really no idea how this
parser works or might even be trying to trick scripts into performing
unintended operations with the help of suitable filename arguments,
every filesystem name needs to be quoted before using it and this
'quoting' is something people either don't bother to do or get wrong
(and even if they don't, that's nothing but code working around the
limitations of some parser). If this is easily possible, such as with
'3-arg open' in Perl, it is much better to avoid the issue altogether,
at least for software supposed to be generally useful/ usable.

Re: Very Sluggish Code

Quoted text here. Click to load it

Good questions.

And questions like these (also: Is the code CPU-bound or disk-bound) can
usually be answered without looking at the code, and can guide you in
the right direction.

But when optimizing some specific code, the most important question is:

Where is it spending most of its time?

You can use a profiler for this (Devel::NYTProf is quite good), but for
a first estimate a few well-placed print statements logging the elapsed
time since program start are often sufficient (and they distort
run-times much less than a profiler).

Quoted text here. Click to load it

I don't know about mysql (recent versions do support server-side
prepared statements, and DBD::mysql supports them, but I haven't run any
benchmarks), but IME (mostly with Oracle) using prepare/execute ist
noticably but not spectacularly faster. This is not surprising:
Preparing an insert is a simple and fast operation (unless you insert
into a view joining a dozen tables), so there is not much time to be
saved. Mostly you save the round-trip time to the database server.

Using array inserts OTOH can result in a massive speedup on databases
which support it. I can't find my old benchmark results (for Oracle) at
the moment, but I think I've seen a speedup of about 2 orders of
magnitude for some workloads. Unfortunately, using multi-row inserts on
MySQL ("insert into ... values(...) values(...) values(...) ...") didn't
gain much last time I tried it.


   _  | Peter J. Holzer    | Deprecating human carelessness and
|_|_) | Sysadmin WSR       | ignorance has no successful track record.
| |   |         |
__/   | |  -- Bill Code on

Site Timeline