#### Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

•  Subject
• Author
• Posted on

I have a question about the Perl Module Algorithm::Diff.

I want to use Algorithm::Diff to calculate the diff of the two
sequences "ab" and "a1bab", the "LCS" (the "Longest Common Sequence")
is a length of 2.

Theoretically there are 3 solutions with LCS = 2:

Solution 1:
---ab
a1bab

Solution 2:
a-b--
a1bab

Solution 3:
a---b
a1bab

I understand that any of those 3 solutions could be returned by
Algorithm::Diff, but I would argue that solution 1 is "better" than
solution 2 or 3, because solution 1 changes only once between '-' and
[ab], whereas solution 2 and 3 change more than once between '-' and
[ab].

This is my Perl program:

use strict;
use warnings;
use 5.010;
use Algorithm::Diff;
use Data::Dumper;
my @old = split //, 'ab';
my @new = split //, 'a1bab';
my \$d = Algorithm::Diff::sdiff(\@old, \@new);
my \$line = Dumper(\$d);
\$line =~ s''xmsg;
say \$line;

The output is
\$VAR1=[['u','a','a'],['+','','1'],['+','','b'],['+','','a'],
['u','b','b']];

This is in fact Solution 3:
a---b
a1bab

How can I teach Algorithm::Diff to choose Solution 1 (the best of the
3 possibilities) ?

Dilbert wrote:

There is only one LCS solution - 'ab'. You're making too much of it and
you don't get to say there are 3 solutions with LCS = 2 (as if you were
assigning 2, and LCS=3 or LCS=1 were possible with the data you
provide), there is just *the* LCS solution, which, in this case, happens
to be 'ab'.

No, this is the smallest set of instructions for turning the first
sequence into the second sequence (same as diff), with what some find to
be an easier to read "side by side" format:

(unchanged, was 'a', stays 'a')
(unchanged, was 'b', stays 'b')

The same set of instructions in diff would just be:

[
['+', 1, '1'],
['+', 2, 'b'],
['+', 3, 'a'],
]

That question doesn't make a lot of sense (to me).

http://search.cpan.org/dist/Algorithm-Diff/lib/Algorithm/Diff.pm

Cheers.

Dilbert wrote:

I may have misunderstood what you were asking a second ago... let me try
again (including diff for each of your three solutions):

[
['+', 0, 'a'],
['+', 1, '1'],
['+', 2, 'b'],
]

[
['+', 1, '1'],
['+', 3, 'a'],
['+', 4, 'b'],
]

[
['+', 1, '1'],
['+', 2, 'b'],
['+', 3, 'a'],
]

Why are any of the three better?  Any way you crack this nut, the
smallest sequence of operations to turn ab into a1bab involves three
adds... so I'm confused as to why you pick one as "best"?

Cheers.

Let me try a more realistic example:

use strict;
use warnings;
use 5.010;
use Algorithm::Diff;
use Data::Dumper;
my @old = split //, 'ab';
my @new = split //, 'Show manual contribution: ab :This word can be
displayed';
my \$d = Algorithm::Diff::sdiff(\@old, \@new);
my \$line = Dumper(\$d);
\$line =~ s''xmsg;
say \$line;

The output alignment (#al1) is:

-a-----------b---------------------------
manual contribution: ab :can be displayed

But what I want is (#al2):

---------------------ab------------------
manual contribution: ab :can be displayed

Why ?

Because, in case I have more than one optimal solution (and this is
the case here), I want to consider not only the value of a character
('a' eq 'a' and 'b' eq 'b'), but also the context each character is
in.

the 'a' in @old has context: {left => 'start', right => 'b'}
the 'b' in @old has context: {left => 'a', right => 'end'}

the 'a' in @new (#al1) context: {left => 'm', right => 'n'}
the 'b' in @new (#al1) context: {left => 'i', right => 'u'}

the 'a' in @new (#al2) context: {left => ' ', right => 'b'}
the 'b' in @new (#al2) context: {left => 'a', right => ' '}

It can easily be seen that @old and @new(#al1) have no context in
common

Whereas @old and @new(#al2) have two context matches, that is
@old('a')->right eq @new(#al2)'a'->right
@old('b')->left eq @new(#al2)'a'->left

That's why I prefer @new (#al2) over @new (#al1)

That should be...

my @new = split //, 'manual contribution: ab :can be displayed';

in the last line, replace
"...@new(#al2)'a'->left..." by
"...@new(#al2)'b'->left..."

Obviously, the metric the OP wants is: assign the "identity edit"
weight eps, and any other edit weight 1+eps, with the exception that N
consecutive edits of the same type get weight N+eps, not N + N*eps.

Looks reasonable (if one convert it to a cheap algorithm to find the
best match...).

Hope this helps,
Ilya

Look at traverse_balanced as a starting point.  Basically you'd need
to write your own diff calculator based off the LCS, using whatever
method you feel is appropriate.  Once you get to the point where
you're looking for the "best" of the possible solutions, you are in
new territory since you'll have to consider the solution set.  I don't
think there's anything in Algorithm::Diff that does that sort of thing
- I believe the code simply finds the first solution that uses the
given LCS.