# seeking advice on problem difficulty

#### Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

•  Subject
• Author
• Posted on
Dear all,

I've been working on this problem for 4 days and still cannot come out a
good solution and would appreciate if you could comment on the problem.

Given a table containing cells delimited by tab like this (the 1st line
being the header and below I use delimiter space for clarity):

ID F1 F2 F3 F4 F5
1 SuperC3 C1 subC4 dummy hi
1 SuperC3 C1 subC3 dumdum hell
2 SuperC3 C1 subC3 hello hello
3 SuperC3 C2 subC7 hel hel
...
1000 SuperC1 C8 subC10 hi hi

and I have another table that group the ID's together, e.g.
Group 1:1,2, 16, 200
Group 2:99, 136, 555
...
Group 15: 123, 124, 999

The two tables above can contain non-overlapping entries though most of the
time the entries do overlap. So the task is to make use of the 2nd table and
to look up the first table for evaluation. By checking group 1, we know that
ID's 1,2, 16 and 200 are in the same group. And what to evaluate? Say, if
ID's 16 and 200 also contain subC3 in F3, then we can assign this group 1
members to be subC3. If the consistency does not exceed certain threshold
(e.g. 0.7), we then look up F2 and check the statistics again and may
conclude C1, and finally for F1 (SuperC3). If F1 consistency also cannot
pass the threshold, "inconsistent" is returned. The main problem here is: I
have been overwhelmed by hash of hash of hash !

First, I use a hash to bin the groups with the ID's, e.g.
\$vhash = 1;
and memorize which \$ID belongs to which \$group
\$group = \$groupID;

Then I use another hash to bin the other table, e.g. (\$cells[] refers to the
parsing result while reading the table in a file)

\$chash = 1;
\$chash = 1;
\$chash = 1;

And the complicated thing appears here....

for \$cid (keys %chash) {
if (exists \$group) {
for \$f1 (keys % {
for \$gid (keys %(\$vhash}) {
....

Okay, I know it's difficult to read... and that's why I'm seeking advice
here as the complicated stuff only just starts there... Is there any
convenient data structure that helps better manage this kind of
"cross/circulate-referencing"-like problem?

## Re: seeking advice on problem difficulty

[ please see original for the indeed gory details ]

Provided I understood the problem correctly, a possible solution could
look like this (this code has had very little testing): First, you
define your groups by associating array references containing the group
with the 'group ID' with the help of a hash:

\$grp = [1, 2];

Then, you create a hash mapping the column name to the column value
for each ID and put these hashes into an id hash associated with the
ID:

\$id = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC4' };
\$id = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC3' };

Provided this has been done, a 'consistency' routine can be defined as

sub consistency(\$\$)
{
my (\$grp, \$col) = @_;
my %seen;

\$seen = 1
for (map { \$id } @});

return 1.0 / keys(%seen);
}

This takes a group ID and a column name as argument and returns the
'consistency' of this column for this group. Then, an array needs to
be created which names the columns in the order they are supposed to
be checked in:

@order  = qw(F3 F2 F1);

Now, a 'decide' routine can be defined like this:

sub decide(\$)
{
my \$grp = \$_[0];

consistency(\$grp, \$_) >= THRESHOLD and return \$_
for (@order);

return undef;
}

This takes a group ID as argument and returns either the name of the
first column (checked in the order given by @order) whose consistency
is >= the THRESHOLD or undef (::= inconsistent). As a complete script:

------------------
#!/usr/bin/perl

use constant THRESHOLD =>    0.7;

my (%grp, %id, @order, \$res);

@order  = qw(F3 F2 F1);

\$grp = [1, 2];

\$id = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC4' };
\$id = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC3' };

sub consistency(\$\$)
{
my (\$grp, \$col) = @_;
my %seen;

\$seen = 1
for (map { \$id } @});

return 1.0 / keys(%seen);
}

sub decide(\$)
{
my \$grp = \$_[0];

consistency(\$grp, \$_) >= THRESHOLD and return \$_
for (@order);

return undef;
}

\$res = decide(1);
\$res //= 'inconsistent';

print("\$res\n");
--------------------

NB: This might not be a very sensible algorithm (ie, "I don't know
this").

## Re: seeking advice on problem difficulty

[ please see original for the indeed gory details ]

Provided I understood the problem correctly, a possible solution could
look like this (this code has had very little testing): First, you
define your groups by associating array references containing the group
members with the 'group ID' with the help of a hash:

\$grp = [1, 2];

Then, you create a hash mapping the column name to the column value
for each ID and put these hashes into an id hash associated with the
ID:

\$id = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC4' };
\$id = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC3' };

Provided this has been done, a 'consistency' routine can be defined as

sub consistency(\$\$)
{
my (\$grp, \$col) = @_;
my %seen;

\$seen = 1
for (map { \$id } @});

return 1.0 / keys(%seen);
}

This takes a group ID and a column name as argument and returns the
'consistency' of this column for this group. Then, an array needs to
be created which names the columns in the order they are supposed to
be checked in:

@order  = qw(F3 F2 F1);

Now, a 'decide' routine can be defined like this:

sub decide(\$)
{
my \$grp = \$_[0];

consistency(\$grp, \$_) >= THRESHOLD and return \$_
for (@order);

return undef;
}

This takes a group ID as argument and returns either the name of the
first column (checked in the order given by @order) whose consistency
is >= the THRESHOLD or undef (::= inconsistent). As a complete script:

------------------
#!/usr/bin/perl

use constant THRESHOLD =>    0.7;

my (%grp, %id, @order, \$res);

@order  = qw(F3 F2 F1);

\$grp = [1, 2];

\$id = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC4' };
\$id = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC3' };

sub consistency(\$\$)
{
my (\$grp, \$col) = @_;
my %seen;

\$seen = 1
for (map { \$id } @});

return 1.0 / keys(%seen);
}

sub decide(\$)
{
my \$grp = \$_[0];

consistency(\$grp, \$_) >= THRESHOLD and return \$_
for (@order);

return undef;
}

\$res = decide(1);
\$res //= 'inconsistent';

print("\$res\n");
--------------------

NB: This might not be a very sensible algorithm (ie, "I don't know
this").

## Re: seeking advice on problem difficulty

Well, the duplicate ID curse appears here... so maybe an array has to use

\$id[0] = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC4' };
\$id[1] = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC5' };
\$id[0] = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC3' };

In fact, I don't quite understand what these codes are doing. what does
\$grp refer to? And as it's no longer \$id but \$id{}[], so how to

Why do you make it an array element?

the exact item (here, say, C1, in addition to F2) should also be
returned.... (Well, that's why I said the problem is complicated...)

Is this a typo? What's this?

## Re: seeking advice on problem difficulty

Using the same name twice was arguably somewhat confusing: The inner
\$grp refers to the scalar variable \$grp (defined in the function) that
holds the group ID and it is used to select the value (array of IDs)
associated with the corresponding key in the %grp hash.

Provided that all different 'column lists' associated with the same ID
can be treated as if they were associated with different IDs, you
could turn the map expression into something like

map { map { \$_-> } @}; } @}

(uncompiled).

Make what? This line defined a scalar variable named \$grp and assigned
the first argument of the sub to it: All arguments are passed in the
array @_, hence \$_[0] is the first argument.

Then, you'll need something similar to Ben's most_frequent routine.

// is similar to || except that it tests for 'definedness' and not
'trueness': It will assign the string to the variable if its value is
undef.

## Re: seeking advice on problem difficulty

"map" seems to be doing the "cross-referencing" task that I have to
accomplish. I have just learnt more complicated data structures a few months
ago and would appreciate if you could explain what's doing here. So when an
element from the \$grp is extracted, it becomes the second "\$_" by the
outside map? and then the inner map extracts another array element from \$id
and this element becomes the first "\$_"? Finally, is there a quick way to
find the intersection between the \$grp and \$id hashes? I want to learn your
elegant one-line map code but it cannot process the non-overlapping id's so
I have to process them the other way.

but this error "Search pattern not terminated at test.pl" occurs...

## Re: seeking advice on problem difficulty

ela> but this error "Search pattern not terminated at test.pl"
ela> occurs...

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.posterous.com/ for Smalltalk discussion

## Sorting hash %seen

I managed to take frequency by:

\$seen++ for (map { \$id } @});

but when I wanna sort it by:

my @sorted_keys = sort { \$seen <=> \$seen}

the error "Use of uninitialized value in hash element at test.pl" appears,
so how to sort the values then?

## Re: Sorting hash %seen

I've not tried it (and, TBH, I have trouble getting my head round it,
map is still new to me) but that doesn't look right to me.

AIUI, the \$_ in \$seen++ takes whatever is in \$_ before you do the
'map', the \$_ in the map function is, I believe, local to the map
function and therefore not available outside the function.

Justin.

--
Justin C, by the sea.

## Re: Sorting hash %seen

No, it takes on the values returned from the map.

Rewrite it, and it may become more clear:

foreach (map { \$id } @}) {
\$seen++
}

\$_ in the 1st line is local to the map.

\$_ in the 2nd line is local to the foreach.

--
email: perl -le "print scalar reverse qq/moc.liamg0cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

## Re: Sorting hash %seen

Provided someone who reads the code is familiar with the semantics of
loop executing code contained in a block but not familiar with the
semantics of the equivalent statement modifiers, this someone will
understand the 'code with the block' and won't understand the 'code
with statement modifier'. Neither one nor the other has any meaning
someone completely without knowledge of the defined semantics of each
construct can understand, and both constructs necessarily have
'clearly' defined semantics because otherwise, a computer couldn't
execute them, IOW, this is not a problem of one variant vs the other
BUT of availability of or lack of knowledge on part of the person
trying to understand the code.

## Re: Sorting hash %seen

Yes. But the for/ foreach loop which processes the 'output' of the
map expression successively binds \$_ to each of the values returned by
that.

## Re: Sorting hash %seen

You have left off the list of things to be sorted.

The uninitialized value comes from the list of things to be sorted.

Since you have not shown where this list comes from, we cannot

--
email: perl -le "print scalar reverse qq/moc.liamg0cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

## Re: Sorting hash %seen

adding %seen solves the problem though warning "Use of uninitialized value
in numeric comparison (<=>) at test.pl" still exists.

Due to time limitation, I shall focus on working out a draft solution first
and thank you very much for being patient in teaching me netiquette and a
lot of other stuff these years.

## Re: Sorting hash %seen

You don't want to sort %seen, as that has no meaning. You want to sort
the keys of %seen:

my @sorted_keys = sort { \$seen <=> \$seen} keys %seen;

--
Jim Gibson

## Choice of data structure

While I'm revising the codes, I find that just because I overrely on hash
and that complicates my problem. What made you make a decision on using
array for "group" while hash for "id"?

## Re: Choice of data structure

'Groups' didn't have named columns.

## Re: seeking advice on problem difficulty

You have a duplicate ID.  If this is not a typo, please ignore the rest
of the post -- I've assumed that IDs are unique.

I would not use hashes for this data.  It is naturally an array of
arrays.  Because you need to scan a column at selected row positions,
I'd make it an array of columns:

# after skipping the first line...
my @column = ();
while (<>) {
my \$c = 0;
push(@, \$_) foreach split;
}

Now you can use a slice to extract the data for a particular column.
Having got your group list as an array:

my @group = map {\$_ - 1} split /,\s*/, \$group_string;

(the -1 is to adjust for zero based indexing) you can write

@[@group]

to get the items from the first column whose frequencies you are
interested in.

<snip code>

Functions are crucial to managing complexity.  I'd want a function
'most_frequent' that can take an array of values and find the frequency
of the most common value among them.  It could return both that value
and the frequency.  Something like:

sub most_frequent
{
my (\$most_freq, %count) = ('', '' => 0);
for my \$item (@_) {
\$most_freq = \$item if ++\$count > \$count;
}
return (\$most_freq, \$count/@_);
}

This is passed a slice for a given column and group:

most_frequent(@[@group])

With that function you simply need to find the first \$col whose most
frequent item meets your threshold.  'First' may not be correct, since
in your example you consider F3 before F1.  Maybe you need the item from
any column that has the greatest frequency?  Maybe there is a specified
order in which the columns should be tested?  Whatever the answer, it
should be easy with the above function.

<snip>
--
Ben.

## Re: seeking advice on problem difficulty

[...]

There is little reason to empty an empty array by assigning an empty
list to it.

[...]

A more sensible initialization would be

my (\$most_freq, %count);

\$most_freq = shift;
\$count = 1;

and then use (@_ + 1) for the division.

## Re: seeking advice on problem difficulty

I guess the above line is doing some sort of initialization of zero's, but
why is ", " used? Making both \$most_freq and %count separated by comma to be
zero?

How does this @_ correspond to @[@group] below?