How does diff/patch work? When I first started to do research on this problem I had no idea about the complexity of the math involved and the lack of C# examples around. It turns out that finding the right resources can make life easier, and in the end the Wikipedia entry and this screencast did a great job making me understand and implement a small diff algorithm in C#. In this post I’ll try to lightly explain the problems faced by diff applications and publish a simple implementation. I strongly recommend the screencast if you’re really trying to understand this.
The bulk of the problem is solved by an algorithm called the Longest Common Subsequence which, as the name says, finds the longest, therefore, most suitable start point to create the diff. The Longest Common Subsequence by itself, or its size, its not really useful to create a diff, but the inner workings of the algorithm produce a matrix which can be used to directly create the list of operations needed to go from the original to the target file.
The Longest Common Subsequence is mathematically classified as a NP-hard problem and very CPU/RAM hungry problem. This is due to the fact a matrix by the size of the product of both file sizes is needed. Therefore to compare 1KB the algorithm would need a full mega byte just for the matrix. Luckily this problem can be simplified although the bigger the data the more time and resources the algorithm will consume.
One simplification done by most diff applications is to divide the file in chucks of characters or more commonly by lines. This reduces greatly the number of items taken into account by the matrix but comparing strings is still an expensive process because each item is compared several times to the others. To make comparisons faster diff applications also use hashing as an optimization, ending up doing the same number of comparisons but comparing integers instead (which is one of the fastest operations modern PCs can do).
Binary diffs (I didn’t study this type of diff deeply yet) seems very similar in concept. Although there are no lines delimiting the contents we can simply use a predefined chuck size (for example 1024) and do a MD5 hash. Due to the nature of binary files I expect the biggest challenge to be optimizing the output and making it more compressible as possible.
I’ve not implemented all this techniques yet although I plan to. Another thing I realized is how easy, using generics, it will be to extend the algorithm to make a diff on random object types like 2 arrays. I expect this to be a very useful library once its completed, but for now I leave here the code for the Longest Common Subsequence and the code to get a simple diff from it:
using System; namespace DiffSharp { public class LCS { public static int[,] GetLongestCommonSubsequenceMatrix(string s1, string s2) { int[,] lcsMatrix = new int[s1.Length, s2.Length]; char letter1, letter2; for (int i = 0; i < s1.Length; i++) { for (int j = 0; j < s2.Length; j++) { letter1 = s1[i]; letter2 = s2[j]; if (letter1 == letter2) { if ((i == 0) || (j == 0)) lcsMatrix[i, j] = 1; else lcsMatrix[i, j] = 1 + lcsMatrix[i - 1, j - 1]; } else { if ((i == 0) && (j == 0)) lcsMatrix[i, j] = 0; else if ((i == 0) && !(j == 0)) lcsMatrix[i, j] = Math.Max(0, lcsMatrix[i, j - 1]); else if (!(i == 0) && (j == 0)) lcsMatrix[i, j] = Math.Max(lcsMatrix[i - 1, j], 0); else if (!(i == 0) && !(j == 0)) lcsMatrix[i, j] = Math.Max(lcsMatrix[i - 1, j], lcsMatrix[i, j - 1]); } } } return lcsMatrix; } public static void GetDiffTreeFromBacktrackMatrix(int[,] lcsMatrix, string s1, string s2, int i, int j) { if (i > 0 && j > 0 && s1[i - 1] == s2[j - 1]) { GetDiffTreeFromBacktrackMatrix(lcsMatrix, s1, s2, i - 1, j - 1); Console.WriteLine(" " + s1[i - 1]); } else { if (j > 0 && (i == 0 || lcsMatrix[i, j - 1] >= lcsMatrix[i - 1, j])) { GetDiffTreeFromBacktrackMatrix(lcsMatrix, s1, s2, i, j - 1); Console.WriteLine("+ " + s2[j - 1]); } else if (i > 0 && (j == 0 || lcsMatrix[i, j - 1] < lcsMatrix[i - 1, j])) { GetDiffTreeFromBacktrackMatrix(lcsMatrix, s1, s2, i - 1, j); Console.WriteLine("- " + s1[i - 1]); } } } } public class Program { static void Main(string[] args) { string s1 = "1ac"; string s2 = "abc"; LCS.GetDiffTreeFromBacktrackMatrix(LCS.GetLongestCommonSubsequenceMatrix(s1, s2), s1, s2, s1.Length, s2.Length); Console.ReadKey(); } } }
Tags: .Net, C#, Diff, LCS, Longest Common Subsequence, Mono, Sample Code
Interesting post. I wonder if it would be useful to include this code as a
feature in MonoDevelop, for example: a command in the context menu to be able to “compare” files.
I have been using this one for the past year or
so.
http://razor.occams.info/code/diff/
even though the author recently put a disclaimer on the site.
here is another
one:
http://www.menees.com/DiffDotNet.htm
I know about the 2 tools you
mention and a few more pieces of code arround. Some have licensing issues and/or are not supported anymore. So I took the challenge of doing one on my own… I mean, isn’t that algorithm sooooooo
pretty?
I think I’m in love with it…
indeed it is
I think
you could save yourself an IL instruction or two by doing this:
if (i == 0) {
if (j == 0)
lcsMatrix[i, j] = 0;
else
lcsMatrix[i, j] = Math.Max(0, lcsMatrix[i, j –
1]);
} else {
if (j == 0)
lcsMatrix[i, j] = Math.Max(lcsMatrix[i – 1, j], 0);
else
lcsMatrix[i, j] = Math.Max(lcsMatrix[i – 1, j], lcsMatrix[i, j –
1]);
}
@Scott,
I guess you’re right,
I spent much time looking at python and other language example so I was under their influente at the late hour I wrote that code
I didn’t spend much time optimizing that code yet
đ
Nice work,
Why don’t you apply diff between your algorithm and this one
http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_subsequence.
đ
@Peter,
I used that as a
base although there are small differences đ I plan to optimize the code and insert some other optimizations I’ve been reading about on several sources including the subversion source
code.
And of course, once I’ve completed the generics implementation it will be noteworthy do add it to the wikibooks.
sthg. isn’t working correct i think.
test it with:
string s1 = “HelloWorld”;
string s2 = “HelloWOrld”;
the result is:
H
e
l
l
– o
– W
o
+ W
+ O
r
l
d
But in fact i just deleted o and added O…
Is this a normal behavior?
This code isn’t working correct it’ll crash on
string s1 = “1ac”;
string s2 = “abcd”;
The correct one is:
static int[,] FillLcs(string source, string target)
{
var lcsMatrix = new int[source.Length + 1,target.Length + 1];
for (var sourceIndex = 0; sourceIndex < source.Length; sourceIndex++)
for (var tagetIndex = 0; tagetIndex 0 && j > 0 && X[i – 1] == Y[j – 1])
{
printDiff(C, X, Y, i – 1, j – 1);
Console.WriteLine(” ” + X[i – 1]);
}
else
{
if (j > 0 && (i == 0 || C[i, j – 1] >= C[i – 1, j]))
{
printDiff(C, X, Y, i, j – 1);
Console.WriteLine(“+ ” + Y[j – 1]);
}
else if (i > 0 && (j == 0 || C[i, j – 1] < C[i – 1, j]))
{
printDiff(C, X, Y, i – 1, j);
Console.WriteLine("- " + X[i – 1]);
}
else
Console.WriteLine();
}
}