# convenient module to take statistics for hashed structures? - Page 2

•  Subject
• Author
• Posted on

## Re: convenient module to take statistics for hashed structures?

Dear George,

First of all, I really appreciate your accomodating character. Well, from
your example, what I expect is that, providing a threshold of 100%, for
ID=01, then by looking up the table, H, then G, F, E fail; and then D=9
should return. And if threshold of 70% is given, then H fails (50% for two
3's) but G=2 (three 2's, 75% > 70%) should return. The same threshold will
be used for all the million rows that may have up to 100k unique ID's. So in
this case, 70% threshold will make analysis for ID=02 also return G=2.

## Re: convenient module to take statistics for hashed structures?

# This one meet your requirements
# It can handle even more data than the previous versions

my @col;
my %data;

\$_ = query('01',75);
print "Field=\$_->[0],Value=@\n";

{
while (<DATA>) { chomp;
my @a = split /\s*\|\s*/, \$_, -1;
if (-1 == \$#col){push @col, @a[1..\$#a] ;next}
unless (1+\$#col==\$#a) {warn "Skip line number \$. \"\$_\" because it have
".(1+\$#a)." fields, while it should have ".(1+\$#col)."\n";next}
\$data->[0]++;
for(my \$i=1;\$i<=\$#a;\$i++)->[1]->[\$i-1]->[0]->++}}

foreach my \$id (keys %data)
{
foreach my \$f ( @->[1]} )
{
foreach my \$v ( keys % )
{
push @{ \$f->[1]->{int 100*( \$f->[0]->/\$data->[0])} }, \$v
}

# remove unnecessary structures
\$f = \$f->[1]
}

# remove unnecessary structures
\$data = \$data->[1]
}

#use Data::Dumper; print Dumper(\%data);exit;
}

sub query
{
for(my \$i=\$#}; \$i>=0; \$i--)
{
foreach my \$RANK (keys %->[\$i]})
{
return [\$col[\$i], \$data->[\$i]->] if \$RANK >= \$_
[1]
}
}

['',[]]
}

__DATA__
ID|B|C|D|E|F|G|H
01|3|7|9|3|4|2|3
01|3|7|9|3|4|2|2
01|3|7|9|5|8|6|6
01|3|7|9|3|4|2|3
02|4|7|9|3|4|2|1
02|4|7|9|3|4|2|2
02|4|7|9|3|4|2|3
02|4|7|9|3|4|2|3

## Re: convenient module to take statistics for hashed structures?

I'm sorry to report that when I feed the threshold to be smaller than 50,
e.g. 10, the result will be

H=2,6;

This result, although passes the threshold, in fact I only needs the largest
one, i.e. 3 (50% abundant), which is not reported and then why does
statement

foreach my \$RANK (keys %->[\$i]})
{
return [\$col[\$i], \$data->[\$i]->] if \$RANK >= \$_[1]
}

can return two values? when one of the "\$RANK" fulfill the requirement, I
think the subroutine will return....

## Re: convenient module to take statistics for hashed structures?

# You only have to change one line for this behavior
#
#  foreach my \$RANK (sort {\$b <=> \$a} keys %->[\$i]})
#
# All together again is

my @col;
my %data;

\$_ = query('01',10);
print "Field=\$_->[0],Value=@\n";

{
while (<DATA>) { chomp;
my @a = split /\s*\|\s*/, \$_, -1;
if (-1 == \$#col){push @col, @a[1..\$#a] ;next}
unless (1+\$#col==\$#a) {warn "Skip line number \$. \"\$_\" because it have
".(1+\$#a)." fields, while it should have ".(1+\$#col)."\n";next}
\$data->[0]++;
for(my \$i=1;\$i<=\$#a;\$i++)->[1]->[\$i-1]->[0]->++}}

foreach my \$id (keys %data)
{
foreach my \$f ( @->[1]} )
{
foreach my \$v ( keys % )
{
push @{ \$f->[1]->{int 100*( \$f->[0]->/\$data->[0])} }, \$v
}

# remove unnecessary structures
\$f = \$f->[1]
}

# remove unnecessary structures
\$data = \$data->[1]
}

#use Data::Dumper; print Dumper(\%data);exit;
}

sub query
{
for(my \$i=\$#}; \$i>=0; \$i--)
{
foreach my \$RANK (sort {\$b <=> \$a} keys %->[\$i]})
{
return [\$col[\$i], \$data->[\$i]->] if \$RANK >= \$_
[1]
}
}

['',[]]
}

__DATA__
ID|B|C|D|E|F|G|H
01|3|7|9|3|4|2|3
01|3|7|9|3|4|2|2
01|3|7|9|5|8|6|6
01|3|7|9|3|4|2|3
02|4|7|9|3|4|2|1
02|4|7|9|3|4|2|2
02|4|7|9|3|4|2|3
02|4|7|9|3|4|2|3

## Re: convenient module to take statistics for hashed structures?

You are already specifying the sort order, so why specify it
in the opposite order from what you want and then reverse it
to get the order that you really wanted?

You should just specify the order that you really want in the
first place:

foreach my \$field (sort keys %col) {
^^   ^^
^^   ^^

--
email: perl -le "print scalar reverse qq/moc.liamg0cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

## Re: convenient module to take statistics for hashed structures?

uncomment the line
#use Data::Dumper; print Dumper(\%data);exit;
and you will undestand the underlying logic by your own.

## Re: convenient module to take statistics for hashed structures?

The value in the %col hash indexed by the "1" key.

No, a "hash slice".

An "array slice".

See the "Slices" section in

perldoc perldata

an anonymous array that contains 2 elements, an (empty) string and
a reference to another (empty) anonymous array.

--