#### Do you have a question? Post it now! No Registration Necessary. Now with pictures!

**posted on**

April 12, 2010, 8:24 pm

I have a question about the Perl Module Algorithm::Diff.

I want to use Algorithm::Diff to calculate the diff of the two

sequences "ab" and "a1bab", the "LCS" (the "Longest Common Sequence")

is a length of 2.

Theoretically there are 3 solutions with LCS = 2:

Solution 1:

---ab

a1bab

Solution 2:

a-b--

a1bab

Solution 3:

a---b

a1bab

I understand that any of those 3 solutions could be returned by

Algorithm::Diff, but I would argue that solution 1 is "better" than

solution 2 or 3, because solution 1 changes only once between '-' and

[ab], whereas solution 2 and 3 change more than once between '-' and

[ab].

This is my Perl program:

use strict;

use warnings;

use 5.010;

use Algorithm::Diff;

use Data::Dumper;

my @old = split //, 'ab';

my @new = split //, 'a1bab';

my $d = Algorithm::Diff::sdiff(\@old, \@new);

my $line = Dumper($d);

$line =~ s''xmsg;

say $line;

The output is

$VAR1=[['u','a','a'],['+','','1'],['+','','b'],['+','','a'],

['u','b','b']];

This is in fact Solution 3:

a---b

a1bab

How can I teach Algorithm::Diff to choose Solution 1 (the best of the

3 possibilities) ?

## Re: Question about Algorithm::Diff

Dilbert wrote:

There is only one LCS solution - 'ab'. You're making too much of it and

you don't get to say there are 3 solutions with LCS = 2 (as if you were

assigning 2, and LCS=3 or LCS=1 were possible with the data you

provide), there is just

***the***LCS solution, which, in this case, happens

to be 'ab'.

No, this is the smallest set of instructions for turning the first

sequence into the second sequence (same as diff), with what some find to

be an easier to read "side by side" format:

(unchanged, was 'a', stays 'a')

(add, was nothing(''), adds '1')

(add, was nothing(''), adds 'b')

(add, was nothing(''), adds 'a')

(unchanged, was 'b', stays 'b')

The same set of instructions in diff would just be:

[

['+', 1, '1'],

['+', 2, 'b'],

['+', 3, 'a'],

]

That question doesn't make a lot of sense (to me).

http://search.cpan.org/dist/Algorithm-Diff/lib/Algorithm/Diff.pm

Cheers.

## Re: Question about Algorithm::Diff

Dilbert wrote:

I may have misunderstood what you were asking a second ago... let me try

again (including diff for each of your three solutions):

[

['+', 0, 'a'],

['+', 1, '1'],

['+', 2, 'b'],

]

[

['+', 1, '1'],

['+', 3, 'a'],

['+', 4, 'b'],

]

[

['+', 1, '1'],

['+', 2, 'b'],

['+', 3, 'a'],

]

Why are any of the three better? Any way you crack this nut, the

smallest sequence of operations to turn ab into a1bab involves three

adds... so I'm confused as to why you pick one as "best"?

Cheers.

## Re: Question about Algorithm::Diff

Let me try a more realistic example:

use strict;

use warnings;

use 5.010;

use Algorithm::Diff;

use Data::Dumper;

my @old = split //, 'ab';

my @new = split //, 'Show manual contribution: ab :This word can be

displayed';

my $d = Algorithm::Diff::sdiff(\@old, \@new);

my $line = Dumper($d);

$line =~ s''xmsg;

say $line;

The output alignment (#al1) is:

-a-----------b---------------------------

manual contribution: ab :can be displayed

But what I want is (#al2):

---------------------ab------------------

manual contribution: ab :can be displayed

Why ?

Because, in case I have more than one optimal solution (and this is

the case here), I want to consider not only the value of a character

('a' eq 'a' and 'b' eq 'b'), but also the context each character is

in.

the 'a' in @old has context: {left => 'start', right => 'b'}

the 'b' in @old has context: {left => 'a', right => 'end'}

the 'a' in @new (#al1) context: {left => 'm', right => 'n'}

the 'b' in @new (#al1) context: {left => 'i', right => 'u'}

the 'a' in @new (#al2) context: {left => ' ', right => 'b'}

the 'b' in @new (#al2) context: {left => 'a', right => ' '}

It can easily be seen that @old and @new(#al1) have no context in

common

Whereas @old and @new(#al2) have two context matches, that is

@old('a')->right eq @new(#al2)'a'->right

@old('b')->left eq @new(#al2)'a'->left

That's why I prefer @new (#al2) over @new (#al1)

## Re: Question about Algorithm::Diff

Obviously, the metric the OP wants is: assign the "identity edit"

weight eps, and any other edit weight 1+eps, with the exception that N

consecutive edits of the same type get weight N+eps, not N + N*eps.

Looks reasonable (if one convert it to a cheap algorithm to find the

best match...).

Hope this helps,

Ilya

## Re: Question about Algorithm::Diff

Look at traverse_balanced as a starting point. Basically you'd need

to write your own diff calculator based off the LCS, using whatever

method you feel is appropriate. Once you get to the point where

you're looking for the "best" of the possible solutions, you are in

new territory since you'll have to consider the solution set. I don't

think there's anything in Algorithm::Diff that does that sort of thing

- I believe the code simply finds the first solution that uses the

given LCS.

#### Site Timeline

- » FAQ 9.21 How do I use MIME to make an attachment to a mail message?
- — Next thread in » PERL Discussions

- » Net::SMTP errors
- — Previous thread in » PERL Discussions

- » s suffix question
- — Newest thread in » PERL Discussions

- » HTML 5 validation software?
- — The site's Newest Thread. Posted in » HTML Authoring Forum