convenient module to take statistics for hashed structures?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

•  Subject
• Author
• Posted on
__DATA__
ID    B    C    D    E    F    G    H
1    3    7    9    3    4    2    3
1    3    7    9    3    4    2    2
1    3    7    9    5    8    6    6
1    3    7    9    3    4    2    3
2    4    7    9    3    4    2    1
2    4    7    9    3    4    2    2
2    4    7    9    3    4    2    3
2    4    7    9    3    4    2    3

For each ID (the above example has two (1 and 2)), I want to identify the
"last common
ancestor: LCA, H being higher preference than B " based on some defined
threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9; if
set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)

While hash is good at allocating different instance easily, I don't know
whether perl supports simple architecture to get the max/min. For the
following code,

@array

For each row
for (\$i=0; \$i<\$numcol; \$i++)
\$array[\$i]++;

if just for B, I know the following should be written in this way:

foreach \$key (keys %B) {     \$Bpertpot = \$B/\$total;    }

for (\$i=0; \$i<\$numcol; \$i++) {
\$maxcol[\$i] = 0;
foreach \$key (keys %Bpertpot) { if (\$Bpertpot> \$maxcol[\$i]) {
\$maxcol[\$i] = \$Bpertpot;    }
}

but then I don't know how to do that for array of hash to traverse...  i.e.
replace the %B and %Bpertpot to something that is compatible with the array
structure... In fact, I wonder if there is already well-established modules
that may have handled this kind of max-min statistics problems that seem to
encounter frequently in the business sector...

Re: convenient module to take statistics for hashed structures?

}

i.e.

Have a look at List::Util which is a core module - it has max and
first
functions that you will find useful.

Any book on Perl will explain how to create and use a hash of arrays
or an
array of arrays.

Re: convenient module to take statistics for hashed structures?

set it to 75%, then it is F=2
I do not see any 2 at F column. I have problem to undestand what you
mean/what you want.

Re: convenient module to take statistics for hashed structures?

Thanks for correcting the mistake. It is G=2 (2,2,6,2; so fulfilling the 75%
requirement) and not F=2. Always check from H (or the last column first).
H's majority is 3, for only 50% abundant, and then look up one by one (F, E,
D, ...). Each ID (without knowing how many incidents beforehand) has to
repeat the same process again and again.

Re: convenient module to take statistics for hashed structures?

#!/usr/bin/perl
#
# ok here is your homework .
# next time try not cheat , because even if
# you pass the lesson, will not learn !

my %col;
my %data;

\$_ = query(1,100);
print "id=1, thr=100% -> Field=\$_->[0],Value=@\n";

\$_ = query(1,75);
print "id=1, thr=100% -> Field=\$_->[0],Value=@\n";

\$_ = query(2,100);
print "id=2, thr=100% -> Field=\$_->[0],Value=@\n";

\$_ = query(2,75);
print "id=2, thr=75% -> Field=\$_->[0],Value=@\n";

\$_ = query(2,50);
print "id=2, thr=50% -> Field=\$_->[0],Value=@\n";

\$_ = query(2,25);
print "id=2, thr=25% -> Field=\$_->[0],Value=@\n";

while(<DATA>){
chomp;
my @a = split /\s+/;
unless (exists \$col)=@a[1..\$#a];next}
++\$data->;
for(my \$i=1;\$i<=\$#a;\$i++){
++\$data->->}->-> } }
foreach my \$id (keys %data) {
foreach my \$field (keys %->} ) {
foreach my \$item (keys %->->->}  ) {
push @{ \$data->->->->{ 100*(
\$data->->->-> / \$data-> ) } } ,
\$item}}}
#use Data::Dumper; print Dumper(\%data);exit;
}

sub query {
my (\$id,\$rank)=@_;
foreach my \$field (reverse sort keys %col) {
if ( exists  \$data->->}->->{ \$rank }  ) {
return [ \$col,
\$data->->}->->] }
}
['',[]]
}

__DATA__
ID   B    C    D    E    F    G    H
1    3    7    9    3    4    2    3
1    3    7    9    3    4    2    2
1    3    7    9    5    8    6    6
1    3    7    9    3    4    2    3
2    4    7    9    3    4    2    1
2    4    7    9    3    4    2    2
2    4    7    9    3    4    2    3
2    4    7    9    3    4    2    3

Re: convenient module to take statistics for hashed structures?

doing...

\$col #what does 1 refer to?
@col #array of hash?
@a[1..\$#a] #array of what?

++\$data->; #hash of hash? and an arbitrary name "line" is
given?
++\$data->->}->-> } }   #oh, this line
is really... hard to know why arrow can be used again and again....

push @{ \$data->->->->{ 100*(
\$data->->->-> / \$data-> ) } } ,
\$item}}}    #what advantage of using push here?

['',[]]    #what is this...?!

Re: convenient module to take statistics for hashed structures?

Sorry for the silly joke at the comment.

this is a just a check to see if we are reading the first line with the
column names

> \$#a
is the last item index of an array. Synonymous are
\$array[ -1 + scalar @array ]
\$array[-1]

This is called hash slice; used to create a hash from an array
my @array = qw/a b c/
my %hash  = ();
@hash{ @array } = some values

Oh some array elements
@array[2..4] -> \$array[2], \$array[3], \$array[4]

Lets keep the total lines of every ID to a hash reference with key "lines"

Arrows are not neccassery, but I found them beautifull
we want to keep our data isolate to a different sub-hash with key data

Here we want to keep all the occasions with the same threshold !
So we if for example there are four different numbers , we can report
back all of the, if the questioned threshold is 25%

This the default answer if no threshiold is found. They are to items the
'' , and an empty array representing the (no) found values.

If you check what I ve done you will find out that it can be re-written
to be almost 10 times faster, but it is goog enough for a start.

Peace.

Re: convenient module to take statistics for hashed structures?

George Mpouras wrote:

Correct.

Only my() can create a hash.

Used to add keys and values to a hash.

from a LIST of keys and a LIST of values.

John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction.                   -- Albert Einstein

Re: convenient module to take statistics for hashed structures?

Ο "John W. Krahn"  έγραψε στο μήνυμα

Correct.

Only my() can create a hash.

I thought that local, our, state, could also do the job

Re: convenient module to take statistics for hashed structures?

nng> Ο "John W. Krahn"  έγραψε στο μήνυμα

>>
>>> @col #array of hash?
>> This is called hash slice;

nng> Correct.

>> used to create a hash

nng> Only my() can create a hash.

nng> I thought that local, our, state, could also do the job

our doesn't create a variable. it only creates a lexical alias to the
variable of the same name in the current package.

local doesn't create a variable. it pushes the value of a variable and
allows for a new value to be put in its place.

state variables are just like my but they don't get reinitialized when
the enclosing block is entered.

uri

--
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------

Re: convenient module to take statistics for hashed structures?

Alias of what if our is only the definition ;
#!/usr/bin/perl
our \$var=1;
print \$var;
exit 0

Re: convenient module to take statistics for hashed structures?

>>
>> our doesn't create a variable. it only creates a lexical alias to the
>> variable of the same name in the current package.

GM> Alias of what if our is only the definition ;
GM> #!/usr/bin/perl
GM> our \$var=1;
GM> print \$var;
GM> exit 0

the current package as i said. what is the default current package?

uri

--
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------

Re: convenient module to take statistics for hashed structures?

While the suggested solution works perfectly for the example data inside the
perl script, it does not work when the data change to:

__DATA__
Identity of query sequence      Superkingdom    Kingdom Subkingdom
Phylum  Class   Order   Family  Genus   Species group   Species
NODE_124_length_77_cov_13.792208        Bacteria        undef   undef
Proteobacteria  Gammaproteobacteria     Enterobacteriales
Enterobacteriaceae      Escherichia     undef   Escherichia coli
NODE_124_length_77_cov_13.792208        Bacteria        undef   undef
Proteobacteria  Gammaproteobacteria     Enterobacteriales
Enterobacteriaceae      Escherichia     undef   Escherichia coli

even though I changed

my @a = split /\s+/;    to   my @a = split /\t/;

Moreover, the current implementation is hard-coding a rank instead of
surpassing a threshold so if I input 10 (i.e. incident abundance exceed 10%
is ok) instead of 100, no result will return at all. Even if I use "100",
the result returned still differs from my expectation.

I expect the program gives the result

id=NODE_124_length_77_cov_13.792208, thr=100% ->
Field=Species,Value=Escherichia coli

but it gives me:

id=NODE_124_length_77_cov_13.792208, thr=100% ->
Field=Order,Value=Enterobacteriales

this statement:
push @{ \$data->->->->{
100*(\$data->->->-> /
\$data-> ) } } , \$item}

looks very new to me, and can anybody further tell me what it is for?

usually I use push like:

push @array, \$element;

and the above code even does not have the symbol ";" but no runtime error
.....

Re: convenient module to take statistics for hashed structures?

Garbage in, garbage out.

Your data are completely inconsistent. You have to solve your input data
problem first.

Make sure that your data contain the same number of columns, and each column
is separated from the other with exactly the same string.

Are you sure the first line will always describe your column names ;

For a start, separated your data using the |  no comma , so tabs , no
spaces.

Make sure that you have the same number of | at every line.

change the split to my @a = split /\|/, \$_, -1;

So after your data looks like the following, try again

ID|C1|C2|C3

1|a1|b1|c1

2|a2|b2|c2

3|a3|b3|c3

Re: convenient module to take statistics for hashed structures?

The problem arises due to the number of fields. Since there are more than 9
fields, and the sort has to do in this way:

foreach my \$field (reverse sort keys %col) {

so perl treats 10 as bigger than 9.

Now the remaining problem is how to solve the threshold problem cleverly.
Originally, George's solution makes use of hash for fast access and in fact
progressive test, 100,99,98, ..., \$threshold can be performed. However, when
the data is large (>1 million rows with a lot of ID's), this may not be a
good idea......

Re: convenient module to take statistics for hashed structures?

# The following version is much faster than the previous

my @col;
my %data;

\$_ = query('NODE_124_length_77_cov_13.792208',  100);
print "Field=\$_->[0],Value=@\n";

\$_ = query('NODE_124_length_77_cov_13.792208',   50);
print "Field=\$_->[0],Value=@\n";

{
while (<DATA>) {
chomp;
my @a = split /\s*\|\s*/, \$_, -1;
if (-1 == \$#col){ push @col, @a[1..\$#a] ;next}
\$data->[0]++;
for(my \$i=1;\$i<=\$#a;\$i++)->[1]->[\$i-1]->[0]->++} }
foreach my \$id ( keys %data ) {
foreach my \$f  ( @->[1]} ) {
foreach my \$v  ( keys % ) {
push @{ \$f->[1]->{int 100*( \$f->[0]->/\$data->[0])} }, \$v}}}
#use Data::Dumper; print Dumper(\%data);exit;
}

sub query {
for (my \$i=\$#->[1]}; \$i>=0; \$i--) {
return [\$col[\$i], \$data->[1]->[\$i]->[1]->] if exists
\$data->[1]->[\$i]->[1]-> }
['',[]]
}

__DATA__
Identity of query sequence       | Superkingdom  | Kingdom | Subkingdom |
Phylum         | Class                | Order             | Family
| Genus       | Species2 | group      | Species
NODE_124_length_77_cov_13.792208 | Bacteria      | undef   | undef      |
Proteobacteria | Gammaproteobacteria  | Enterobacteriales |
Enterobacteriaceae | Escherichia | undef   | Escherichi1 | coli
NODE_124_length_77_cov_13.792208 | Bacteria      | undef   | undef      |
Proteobacteria | Gammaproteobacteria  | Enterobacteriales |
Enterobacteriaceae | Escherichia | undef   | Escherichi2 | coli

Re: convenient module to take statistics for hashed structures?

wrote:

[...]

[...]

When posting code to a newsgroup, please ensure that proper indentation
and line wraps are preserved. This is very hard to read because of the
missing indentation and doesn't work as posted because of the extra
newlines.

hp

Re: convenient module to take statistics for hashed structures?

ms outlook express is messing up things

begin 666 nice_problem2.pl
M(R!&;W(@96%C:"!)1" H=&AE(&%B;W9E(&5X86UP;&4@:&%S('1W;R H,2!A
M;F0@,BDI+"!)('=A;G0@=&\@:61E;G1I9GD@=&AE( T*(R!L87-T(&-O;6UO
M;B!A;F-E<W1O<CH@3\$-!#0HC(\$@@8F5I;F<@:&EG:&5R('!R969E<F5N8V4@
M=&AA;B!"("(@8F%S960@;VX@<V]M92!D969I;F5D('1H<F5S:&]L9"X-"@T*
M#0H-"FUY(\$!C;VP[#0IM>2 E9&%T83L-"E)E861\$871A*"D[#0H-"@T*)%\@
M/2!Q=65R>2@G3D]\$15\Q,C1?;&5N9W1H7S<W7V-O=E\Q,RXW.3(R,#@G+#\$P
M,"D[#0IP<FEN=" B1FEE;&0])%\M/ELP72Q686QU93U >R1?+3Y;,5U]7&XB
M.PT*#0HD7R ]('%U97)Y*"=.3T1%7S\$R-%]L96YG=&A?-S=?8V]V7S\$S+C<Y
M,C(P."<L(#4P*3L-"G!R:6YT(")&:65L9#TD7RT^6S!=+%9A;'5E/4![)%\M
M/ELQ77U<;B([#0H-"@T*#0IS=6(@4F5A9\$1A=&\$-"GL-"@EW:&EL92 H/\$1!
M("1?+" M,3L-"@EI9B H+3\$@/3T@)"-C;VPI>R!P=7-H(\$!C;VPL(\$!A6S\$N
M+B0C85T@.VYE>'1]#0H))&1A=&%[)&%;,%U]+3Y;,%TK*SL-"@D)#0H)"69O
M/ELQ72T^6R1I+3%=+3Y;,%TM/GLD85LD:5U]*RL-"@D)?0T*"7T-"@T*#0H)
M9F]R96%C:"!M>2 D:60@*"!K97ES("5D871A("D-"@E[#0H)"69O<F5A8V@@
M;7D@)&8@("@@0'LD9&%T87LD:61]+3Y;,5U]("D-"@D)>PT*"0D)9F]R96%C
M:"!M>2 D=B @*"!K97ES("5[)&8M/ELP77T@*0T*"0D)>PT*"0D)<'5S:"!
M>R D9BT^6S%=+3Y[:6YT(#\$P,"HH("1F+3Y;,%TM/GLD=GTO)&1A=&%[)&ED
M?2T^6S!=*7T@?2P@)'8-"@D)"7T-"@D)?0T*"7T-"@T*(W5S92!\$871A.CI\$
M=6UP97([('!R:6YT(\$1U;7!E<BA<)61A=&\$I.V5X:70[#0I]#0H-"@T*#0IS
M77T[("1I/CTP.R D:2TM*0T*"7L-"@ER971U<FX@6R1C;VQ;)&E=+" D9&%T
M87LD7ULP77TM/ELQ72T^6R1I72T^6S%=+3Y[)%];,5U]72!I9B!E>&ES=',@
M)&1A=&%[)%];,%U]+3Y;,5TM/ELD:5TM/ELQ72T^>R1?6S%=?0T*"7T-"@T*
M6R<G+%M=70T*?0T*#0H-"E]?1\$%405]?#0I)9&5N=&ET>2!O9B!Q=65R>2!S
M97%U96YC92 @(" @("!\(%-U<&5R:VEN9V1O;2 @?"!+:6YG9&]M('P@4W5B
M:VEN9V1O;2!\(%!H>6QU;2 @(" @(" @('P@0VQA<W,@(" @(" @(" @(" @
M(\$=E;G5S(" @(" @('P@4W!E8VEE<S(@?"!G<F]U<" @(" @('P@4W!E8VEE
M<PT*3D]\$15\Q,C1?;&5N9W1H7S<W7V-O=E\Q,RXW.3(R,#@@?"!"86-T97)I
M82 @(" @('P@=6YD968@("!\('5N9&5F(" @(" @?"!0<F]T96]B86-T97)I
M82!\(\$=A;6UA<')O=&5O8F%C=&5R:6\$@('P@16YT97)O8F%C=&5R:6%L97,@
M?"!%;G1E<F]B86-T97)I86-E864@?"!%<V-H97)I8VAI82!\('5N9&5F(" @
M?"!%<V-H97)I8VAI,2!\(&-O;&D-"DY/1\$5?,3(T7VQE;F=T:%\W-U]C;W9?
M,3,N-SDR,C X('P@0F%C=&5R:6\$@(" @("!\('5N9&5F(" @?"!U;F1E9B @
M(" @('P@4')O=&5O8F%C=&5R:6\$@?"!'86UM87!R;W1E;V)A8W1E<FEA("!\
M(\$5N=&5R;V)A8W1E<FEA;&5S('P@16YT97)O8F%C=&5R:6%C96%E('P@17-C
G:&5R:6-H:6\$@?"!U;F1E9B @('P@17-C:&5R:6-H:3(@?"!C;VQI
`
end

Re: convenient module to take statistics for hashed structures?

Yes, your latest implementation works very fast, even for a million records!

I wanna change your implementation from "discrete" checking to "continuous"
one, the logic is to first sort (rank keys: *** expected range: (0-100] ***)
numerically, then test if the "largest" key (e.g. 100, 75 etc) is larger
than the threshold specified. My problem is that I don't know how to refer
to the keys under

\$data->[1]->[\$i]->[1]

Writing something like "foreach my \$field (sort keys
%data->[1]->[\$i]->[1])" (Thanks for McClellan's teaching on
appropriately using sort here) does not work. Moreover, there's no need to
"foreach" here as if the largest one also can't surpass the threshold,
neither the smaller ones can. So how to avoid "foreach" here?

Re: convenient module to take statistics for hashed structures?

For the data

ID|B|C|D|E|F|G|H
01|3|7|9|3|4|2|3
01|3|7|9|3|4|2|2
01|3|7|9|5|8|6|6
01|3|7|9|3|4|2|3
02|4|7|9|3|4|2|1
02|4|7|9|3|4|2|2
02|4|7|9|3|4|2|3
02|4|7|9|3|4|2|3

what would be your input and what do you expect ?
An example make things more clear.