How to cluster data

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

Dear all, excuse me for this maybe "too simple" question..

I am trying to find a software/algorythm that can

"cluster" simple data on an excel sheet

              Variable a   Variable b     Variable c
Case 1        1                   0              0
Case 2        0                   1              1
Case 3        1                   0              0
Case 4        1                   1              0
Case 5        0                   1              1

The systems recognizes that there are 3 possible clusters:

the first with cases that has Variable a as true,
the second has Variables b and c
the third is "all the rest"

              Variable a   Variable b     Variable c
Case 1        1                   0              0
Case 3        1                   0              0

Case 2        0                   1              1
Case 5        0                   1              1

Case 4        1                   1              0

Thank you in advance

Re: How to cluster data

Quoted text here. Click to load it

This is a PHP newsgroup - so algorithm design for MS Excel is way off

Or do you mean that you want to create a spreadsheet using PHP, and
you want the output formatted in a particular way? (thats 2 seperate
problems - which one can't you solve?)


Re: How to cluster data

Luca wrote:
Quoted text here. Click to load it

You really don't give enough detail to give a full answer. The reason
being that depending on what you want this for, there are better or
worse ways of doing this. For example, if you are just regarding the
Excel spreadsheet as a one-time source data and you wont be referring to
it again, and if you'll be using this data on an ongoing basis, then for
all my data sorting needs, I stick things in SQL and then everything
becomes simple (to me ;). Will the data always be 0 or 1? If it is, then
I can do some nice things with binary arithmetic to group the rows. But
anyway, answering exactly the question that you asked, here is one
algorithm that will do what you want (I'll do it in PHP since this is
the PHP group):

$source_data = array(
array('identifier' => 'Case 1', 'a' => 1, 'b' => 0, 'c' => 0),
array('identifier' => 'Case 2', 'a' => 0, 'b' => 1, 'c' => 1),
array('identifier' => 'Case 3', 'a' => 1, 'b' => 0, 'c' => 0),
array('identifier' => 'Case 4', 'a' => 1, 'b' => 1, 'c' => 0),
array('identifier' => 'Case 5', 'a' => 0, 'b' => 1, 'c' => 1));

$clustered_data = array();

foreach($source_data as $row)
         $key = $row['a'] . $row['b'] . $row['c'];
         $clustered_data[$key][] = $row;


Now this may well not be perfect for you. I don't know the wider context
of what you're trying. However, it will give you your clusters.
Hopefully you can amend this to get the format that you want, e.g. you
might want a more useful key for the array. You probably don't want to
store the whole duplicate row, instead just amending it to store the
'identifier' part so your $clustered_data just lists the key and the
'identifier' field. I'm not going to bother getting into these details
without knowing more about your specific context.

I'll mention two likely problems however, depending on what you're
actually trying to do in practice. Firstly, you would also have a
problem if you had other than just 1s and 0s in your table because of
the risk of key collisions: e.g. 11 in variable a and 0 in variable b
looks the same as 1 in variable a and 10 in variable b when you stick
them together as strings. There are easy ways around that now you have
the algorithm.  THe other likely scenario is where you have a variable
number of columns in each scenario. In this case, you'll probably need
to have a pre-scan of the data first to work out the number of columns
and generate your key in a more dynamic manner.

Hope all this helps. Please post back to comment.

Taliesin Nuin.

Re: How to cluster data

Taliesin Nuin wrote:
Quoted text here. Click to load it

By the way, it's not necessarily a "too simple" question. I had a
similar problem recently and had to work it out for myself. Like many
things, it's obvious in retrospect.


Site Timeline