# lest talk a litle more about directories

#### Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

•  Subject
• Author
• Posted on
Ok let’s talk about a little more about creating directories (or any other
trie structure entries).
There are three basic methods :

1)    The simple, just start create them from top to bottom. This is give us
N disk accesses always. Silly but do not underestimate it because it does
not consume cpu.

2)    The smarter. Search from bottom to top for the first existing node and
start creating from this point and bellow; this is give us N only at the
worst case.

3)    The most wise approach is to define an array from all the upper nodes.
Then perform a binary tree search at this array against a node existence
code. This is give as log(N) accesses at worst case.

So what is the best solution; Unfortunately there is not ….

If your directories “usually” exist, the 2nd method will be the faster
because it will finish at the very first iterations.
If your directories “usually” do not exist, the 3rd method will be the
faster because it will dig faster at the bottomless abyss.

Because the 2nd method is almost trivila I will write the code only for the
3rd binary tree method

mkdir_btree( 'd1/d2/d3/d4/d5' ) or die "\$^E\n";

sub mkdir_btree
{
my @array = split /[\/\]/, \$_[0];
return 1 if -1 == \$#array;
\$array[0] eq '' && splice @array, 0, 2, "/\$array[1]";
my (\$begin, \$end) = (0, scalar @array);

while (\$begin < \$end)
{
my \$cut = int((\$begin + \$end)/2);
-d join('/', @array[0..\$cut]) ? (\$begin = \$cut + 1) : (\$end = \$cut)
}

return 1 unless \$begin < @array;
mkdir(join '/', @array[0 .. \$_]) || return 0 for \$begin .. \$#array;1
}

Also here is small benchmark. If you want to test it , (its better to run it
multiple times)

#!/usr/bin/perl
# Dir creation benchmark
use strict;
use warnings;
use Time::HiRes;
use File::Path;

my \$how_many      = 1000;
my \$how_deep      = 40;
my \$root_test_dir = '/tmp/test_this';

# create the dirs array;
print "Creating the test dirs array\n";
my @Test_dirs;
my \$start_time=[];
for (0 .. \$how_many) { my @R = \$root_test_dir;
for (0 .. \$how_deep) {
push @R, sprintf "%s%02d", ('a'..'z')[int(rand 24)], int(rand 100) }
push @Test_dirs, join('/',@R) }

foreach my \$code (qw/

Mkdir_recursive
mkdir_btree
File_Path_module

/){
system("/bin/rm -rf \$root_test_dir") if -d \$root_test_dir;
\$start_time = [ Time::HiRes::gettimeofday ];
print "Testing \$code ... ";
foreach (@Test_dirs) { &(\$_) or die("\$!") }
print "finished at ", Time::HiRes::tv_interval(\$start_time) ," sec\n"
}

sub File_Path_module
{
my \$err;
File::Path::mkpath( \$_[0], {'error' => \$err} );
@ ? 0 : 1
}

sub Mkdir_recursive
{
return 1 if \$_[0] eq '' || -d \$_[0];
Mkdir_recursive( \$_[0] =~/^(.*?)\/[^\/]+\$/ ) || return undef;
mkdir \$_[0] || return undef
}

sub mkdir_btree
{
my @array = split /\//, \$_[0];
return 1 if -1 == \$#array;
\$array[0] eq '' && splice @array, 0, 2, "/\$array[1]";
splice @array, 0, 2, "/\$array[1]" if '' eq \$array[0];
my (\$begin, \$end) = (0, scalar @array);

while (\$begin < \$end)
{
my \$cut = int((\$begin + \$end)/2);
-d join('/', @array[0..\$cut]) ? (\$begin = \$cut + 1) : (\$end = \$cut)
}

return 1 unless \$begin < @array;
mkdir(join '/', @array[0 .. \$_]) || return 0 for \$begin .. \$#array;1
}

END {
print "Clearing test dir: \$root_test_dir\n";
system("/bin/rm -rf \$root_test_dir") if -d \$root_test_dir }

## Re: lest talk a litle more about directories

"George Mpouras"
writes:

As I already tried to tell you last time: Method 1 tuned for the case
where most directories have to be created and method 2 for the case
where most directories already exist. Both stat (the system call
behind the -X) and mkdir need to determine information about a
particular directory entry, stat because that's what it is supposed to
do and mkdir because it must not do anything if this directory entry
already exists. This means creating all directories in the path
/a/b/c/d from scratch will perform first perform for stat calls,

stat("/a/b/c/d")
stat("/a/b/c")
stat("a/b");
stat("/a");

which will fail with ENOENT, followed by four mkdir calls,

mkdir("/a");
mkdir("/a/b");
mkdir("/a/b/c");
mkdir("/a/b/c/d");

for the 'backward' method while the other will just do the 'mkdirs'.

I don't know this is for Windows, but a valid UNIX(*) pathname may use
any number of slashes to separate two directories,

[rw@sapphire]~ \$perl -e "chdir('//////////////'); system('pwd');"
/

## Re: lest talk a litle more about directories

-----------------------

I'm afraid that's not compatible with VMS

## Re: lest talk a litle more about directories

Στις 27/7/2013 23:26, ο/η gamo@telecable.es έγραψε:

this one maybe ?   split /\/+/, \$_[0];
what is the VMS ?

## Re: lest talk a litle more about directories

this one maybe ?   split /\/+/, \$_[0];
what is the VMS ?
-----------------------------

I don't think it works. As I badly remember directories in VMS are not
a special file, e.g. the command to change directory is SET DEF=
and it's not available to cd until the aparition of POSIX compatible
interface.
And why is important to be compatible with VMS/OpenVMS? Because it has
a security features very interesting. Last time I see it working with
a X interface was in a hidroelectrical power station, controlling the
open and close of gates to water.
VMS has other remarcable utilities as the EDT editor, since I use the
JED editor emulating EDT, with Perl syntax highlighting, of course.

## Re: lest talk a litle more about directories

Quoth gamo@telecable.es:

Just use File::Spec (or one of the interfaces to it). That's what it's
there for.

(VMS has complicated path syntax rules. It supports Unix-style paths as
well as traditional VMS paths, which look like
'device:[dir.dir]file.ext;version'. AIUI it also supports quoting in the
path spec, so it's possible to have (for instance) a directory name with
a dot in it.)

Ben

## Re: lest talk a litle more about directories

Στις 27/7/2013 23:26, ο/η gamo@telecable.es έγραψε:

are you working in one of these ?
http://en.wikipedia.org/wiki/VAX
unbelievable cool !

## 'portable' filesystem manipulation (was: lest talk a litle more about directories)

gamo@telecable.es writes:

Well, so what? Does VMS run on anything but VAX(fossilized) and
Alpha(dead)? Looking into the other direction: A sensible
implementation for VMS also won't be compatible with anything except
'similar systems' (that would be VMS). An idea I recently had
while looking at the File::Path code: Given that there really isn't
such a things as 'distributable pre-compiled Perl code'[*], the
characteristics of the host filesystem are known at compile time and
this means selection an implementation which works for a particular
kind of file system/ operating system should really be done at
compile-time and not at runtime with the help of 'generic' switching
code which is - in effect - the least common denominator

[*] I had a short conversation with the present 'Perl compiler code'
because my boss really didn't like the fact that we were distributing
Perl source code to customers. I asked it "Can you support this?" and

## Re: 'portable' filesystem manipulation (was: lest talk a litle more about directories)

Modern versions run on IA64, and the perl port is actively maintained.
See e.g.
http://www.nntp.perl.org/group/perl.daily-build.reports/2013/07/msg146975.html
.

Ben

## Re: 'portable' filesystem manipulation

That would be something like this

http://www8.hp.com/us/en/products/integrity-servers/product-detail.html?oid=4311905 #!tab=features

Should I ever get stinkin' filthy rich (hardly a reason to fear that
this might actually happen), I'll buy one of these and see if I can
turn it into a Linux workstation. Until then, I don't expect any code
written by me to run on such hardware ...

## Re: 'portable' filesystem manipulation

[...]

perl seems a little schizophrenic in this respect since it both
includes File::Spec which provided 'portable pathname manipulation' in
an IMHO sensible way[*] by delegateing all actual work to a
system-specific implementation module and File::Basename (used by
File::Path) which employs if - elsif - else-cascades to select the
code to execute for a particular system at runtime.

[*] Mostly. It still relies on a hardcoded 'system list' in order to
determine which module to load, uses @ISA for forwarding calls to
'implementation modules' and loads these via require. Ideally, it
should select a module based on 'system name' only (possibly with an
override facility) and the implementation modules should put their
public routines on @EXPORT. The 'entry point module' could then just
use the implementation module in order to put everything in place with

The File::Spec::Functions module provides something similar to this
but it uses

foreach my \$meth (@EXPORT, @EXPORT_OK) {
my \$sub = File::Spec->can(\$meth);
no strict 'refs';
* = sub {&\$sub('File::Spec', @_)};
}

at runtime in order to generate proxy subs. The 'double call' is
necessary to work around the IMHO totally weird descision to use
'method calls' for this in the first place (explanation why this could
make any sense greatly apprecicated) ...

## Re: 'portable' filesystem manipulation

Yes, I agree, it's a bit peculiar. However, they both work, and have
been thoroughly tested on many systems over many years, so I suppose
noone thinks it's worth changing them.

What else would you do?

System names are too whimsical to use like that; they need to be
normalised to something sensible. That's all the list of 'known systems'
is doing.

The method implementation is indeed bizarre. I think the only reason it
works like that is that the module was written in the very early days of
Perl OO, before any sort of best practices had emerged, and someone
thought it would be a clever idea. (IMHO the same applies to Exporter
used to do the same, apparently, but has been changed.)

OTOH, does it matter? The code's there, it's tested, it works, and it's
reliable. The weirdness can usually just be ignored.

Ben

## Re: 'portable' filesystem manipulation

File::Basename consists of a system-dependant part, namely, the regexes
& associated handling code dealing with the actual pathnames, and the
'pure Perl' switching logic. And I doubt that there was ever a perl
port where if - elsif - elsif - else didn't work because of inherent
'portability problems' in this Perl code. Consequently, the 'tested on
many systems' part doesn't really apply to that.

There are two lists of 'known systems', one in Basename.pm and one in
Spec.pm and they don't only differ but even treat the same system in
different ways, at least seemingly. I wouldn't be suprised if there
some more different 'lists of known systems' (handling seemingly
identical systems in different ways) in the Perl core and I'm certain
that there will be n more (with n possibly being large) 'different
lists of known systems' (handling ...) all over CPAN. That's an
unmaintainable mess and bound to cause problems (eg, will someone
fixing a bug in File::Spec always remember or even know that
File::Basename needs to be changed as well and how much additional
work is it to make a semantically equivalent change to both which
correct within the constraints of either?).

Also, I wouldn't be surprised when people who actually need 'portable
filename manipulation' would usually end up trying n different modules
in turn until they found one which works for their problem and that
forget about the issue, IOW, in the end, bugs won't ever or won't
usually get fixed.

[...]

[...]

It matters to me: I wouldn't mind using 'portable pathname
manipulation code' if Perl offered a sensible implementation. Since it
IMHO doesn't, I'll stick to using 'regexes for UNIX(*)'. That's all I
usually care for and even if I had to support 'weird filesystems', I
would rather lift the filesystem-specific bits from one of the modules
and (in the most 'complicated' case), integrate them with some
'selection logic' which doesn't unduly 'penalize' my code.

If a library 'wants' to be used, it should offer something better than
'this may save you an hour of work' --- 'Oh me Gawd, how can I get
this done at all!" is not a universal problem.

## Re: 'portable' filesystem manipulation

The inherent problem with the sort of logic File::Basename uses is that
it becomes very hard to maintain, so there is a tendency for the
less-popular branches to suffer bitrot. The fact that, in practice, it
(mostly) works suggests this hasn't happened yet.

I don't think that can be the case in practice, at least not on the
platforms on which perl still builds. There are clearly branches in both
modules for systems like 'AmigaOS' and 'Symbian' which end up behaving
differently, but perl doesn't build on those systems any more.

I agree entirely: the whole thing is a stinking mess once you start
looking inside. I'm not saying this is a good thing, I'm just saying
that there are (reasonable) reasons it ended up that way and reasons it
hasn't been thrown out and replaced. The most important of these, of
course, is that at least File::Spec is a Toolchain Module, meaning the
consequences of breaking it are very high. This tends to lead to
somewhat excessively conservative maintanance.

I don't know of any other portable-filename modules on CPAN; do you?
(Path::Class just uses File::Spec underneath.) I would quite like to see
a reasonable alternative.

You are, I presume, writing code which won't be published, for use on
known systems? I tend to do the same in that situation. basename and
dirname are sometimes more convenient than regexes, but I find the
File::Spec interface (even F:S:Functions, which is an improvement)
generally makes things more confusing rather than less.

The situation is completely different when writing code to be published.
Code on CPAN Should Be Portable; there's no real excuse for not making
it at least as portable as File::Spec.

You can do that, but once you start trying to support 'everything
reasonable' it rapidly gets out of hand.

Ben

## Re: 'portable' filesystem manipulation

There are unresolved problems in File::Spec, e.g.,

<https://rt.cpan.org/Public/Bug/Display.html?id=75319

--
Shmuel (Seymour J.) Metz, SysProg and JOAT  <http://patriot.net/~shmuel

Unsolicited bulk E-mail subject to legal action.  I reserve the
right to publicly post or ridicule any abusive E-mail.  Reply to
domain Patriot dot net user shmuel+news to contact me.  Do not

## Re: lest talk a litle more about directories

"George Mpouras"
writes:

[...]

[...]

The 'log(N) worst case' is not true: This algortihm has a constant
overhead of log(N) stat calls for every case (this could be improved
somewhat by using mkdir in the loop). This means it is worse than
'create from the top' when all directories have to be created and will
become worse than 'create from the bottom' if the number of
subdirectories which have to be created at depth n is large enough
that (m - 1) * log(n) stat calls amount to more work than what was avoided
when creating the first subdirectory at this depth. The work saved
when creating the first directory is (based on mkdir)

2n - 1 - n - log(n)

This must be larger than (m - 1) * log(n) for the binary search to
beneficial. Putting this together,

2n - 1 - n - log(n) > (m - 1) * log(n)

this ends up as (Rechenfehler vorbehalten)

(n - 1)/log(n) > m

The /usr tree on the machine I used as example has a maximum depth of
12. Using this program,

--------------
sub log2
{
return log(\$_[0])/log(2);
}

sub threshold
{
return (\$_[0] - 1) / log2(\$_[0]);
}

print(\$_, "\t", threshold(\$_), "\n") for 2 .. 12;
--------------

yields the following threshold values:

2    1
3    1.26185950714291
4    1.5
5    1.72270623229357
6    1.93426403617271
7    2.13724312264813
8    2.33333333333333
9    2.52371901428583
10    2.70926996097583
11    2.89064826317888
12    3.06837240216243

(as soon as a directory at depth [first column] has more
subdirectories than [second column], the algorithm sucks).

I've used find /usr -type d to generate a list of directories, edited
the first line to remove a trailing slash and then used the following
program

-------------
use List::Util qw(sum);

@ds = map {chomp; \$_} <STDIN>;

for (@ds) {
/(.*?\/)[^\/]*\$/ and \$path = \$1;
#    print(\$_, "\t", \$path, "\n");

++\$dirs;
}

for (keys(%dirs)) {
\$depth = tr/\///;
push(@, \$dirs);
}

for (1 .. \$#counts) {
print(\$_, "\t", scalar(@), "\t", sum(@) / @, "\n") if @;
}
---------------

to calculate the average number of subdirectories at each depth.
The result is (depth, number of times encountered, avg)

1    1    1
2    1    10
3    6    82.3333333333333
4    228    10.0833333333333
5    673    2.93610698365527
6    521    8.13435700575816
7    556    2.42446043165468
8    289    1.96193771626298
9    136    2.63970588235294
10    85    3.23529411764706
11    56    3.17857142857143
12    5    5

In other words (if the results are at least mostly correct :-), this
idea sucks dead hippopotamusses through nanotubes in the real world.

## Re: lest talk a litle more about directories

Στις 27/7/2013 15:25, ο/η Rainer Weikusat έγραψε:

Do not forget that computers do not if the directories have to be
created. So it finds faster if they have to be.

This idea do not sucks. When deep directories usually do not exist it is
the faster one.

## Re: lest talk a litle more about directories

the binary search is the fastest method to access a specific record on a
unknown data structure, that is why is often used from databases, but as
I said there is not one absolute answer. the fallowing factors must
considered before decide what to do on a specific machine:

* what is the timecost of a cpu cycle
* what is the timecost to ask for a node existense
* what is the timecost to create a node
* which is the statistical behaviour of the application
* what is the current status of your data

So the problem is not so determenistic after all

## Re: lest talk a litle more about directories

George Mpouras wrote:
) Ok let???s talk about a little more about creating directories (or any other
) trie structure entries).
) There are three basic methods :
)
) 1)    The simple, just start create them from top to bottom. This is give us
) N disk accesses always. Silly but do not underestimate it because it does
) not consume cpu.
)
) 2)    The smarter. Search from bottom to top for the first existing node and
) start creating from this point and bellow; this is give us N only at the
) worst case.
)
) 3)    The most wise approach is to define an array from all the upper nodes.
) Then perform a binary tree search at this array against a node existence
) code. This is give as log(N) accesses at worst case.

Have you considered the possibility that it takes O(N) disk accesses
to check if a directory (a1/a2/.../aN) exists ?
The OS will have to open a1, find a2, open a2, find a3, etc.
However, if you already have a handle to directory a2,
then checking or creating a3 is a single step.

If that is the case, method 1 will always be fastest.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

## Re: lest talk a litle more about directories

The best method is to use File::Path unless you have thoroughly
documented that you need to optimize directory creation because
File::Path isn't fast enough.  Using File::Path optimizes for
programmer time and portability.

If you really believe that your methods are superior, you should write
them up as a patch to File::Path, so that others can thoroughly test
your code and include them in the module if they work properly.

--keith

--
kkeller-usenet@wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information