Blog

Longest Common Subsequence – Diff Algortithm in C#

How does diff/patch work? When I first started to do research on this problem I had no idea about the complexity of the math involved and the lack of C# examples around. It turns out that finding the right resources can make life easier, and in the end the Wikipedia entry and this screencast did a great job making me understand and implement a small diff algorithm in C#. In this post I’ll try to lightly explain the problems faced by diff applications and publish a simple implementation. I strongly recommend the screencast if you’re really trying to understand this.

The bulk of the problem is solved by an algorithm called the Longest Common Subsequence which, as the name says, finds the longest, therefore, most suitable start point to create the diff. The Longest Common Subsequence by itself, or its size, its not really useful to create a diff, but the inner workings of the algorithm produce a matrix which can be used to directly create the list of operations needed to go from the original to the target file.

The Longest Common Subsequence is mathematically classified as a NP-hard problem and very CPU/RAM hungry problem. This is due to the fact a matrix by the size of the product of both file sizes is needed. Therefore to compare 1KB the algorithm would need a full mega byte just for the matrix. Luckily this problem can be simplified although the bigger the data the more time and resources the algorithm will consume.

One simplification done by most diff applications is to divide the file in chucks of characters or more commonly by lines. This reduces greatly the number of items taken into account by the matrix but comparing strings is still an expensive process because each item is compared several times to the others. To make comparisons faster diff applications also use hashing as an optimization, ending up doing the same number of comparisons but comparing integers instead (which is one of the fastest operations modern PCs can do).

Binary diffs (I didn’t study this type of diff deeply yet) seems very similar in concept. Although there are no lines delimiting the contents we can simply use a predefined chuck size (for example 1024) and do a MD5 hash. Due to the nature of binary files I expect the biggest challenge to be optimizing the output and making it more compressible as possible.

I’ve not implemented all this techniques yet although I plan to. Another thing I realized is how easy, using generics, it will be to extend the algorithm to make a diff on random object types like 2 arrays. I expect this to be a very useful library once its completed, but for now I leave here the code for the Longest Common Subsequence and the code to get a simple diff from it:

using System;

namespace DiffSharp
{
	public class LCS
	{
		public static int[,] GetLongestCommonSubsequenceMatrix(string s1, string s2)
		{
			int[,] lcsMatrix = new int[s1.Length, s2.Length];
			char letter1, letter2;

			for (int i = 0; i < s1.Length; i++)
			{
				for (int j = 0; j < s2.Length; j++)
				{
					letter1 = s1[i];
					letter2 = s2[j];

					if (letter1 == letter2)
					{
						if ((i == 0) || (j == 0))
							lcsMatrix[i, j] = 1;
						else
							lcsMatrix[i, j] = 1 + lcsMatrix[i - 1, j - 1];
					}
					else
					{
						if ((i == 0) && (j == 0))
							lcsMatrix[i, j] = 0;
						else if ((i == 0) && !(j == 0))
							lcsMatrix[i, j] = Math.Max(0, lcsMatrix[i, j - 1]);
						else if (!(i == 0) && (j == 0))
							lcsMatrix[i, j] = Math.Max(lcsMatrix[i - 1, j], 0);
						else if (!(i == 0) && !(j == 0))
							lcsMatrix[i, j] = Math.Max(lcsMatrix[i - 1, j], lcsMatrix[i, j - 1]);
					}
				}
			}

			return lcsMatrix;
		}

		public static void GetDiffTreeFromBacktrackMatrix(int[,] lcsMatrix, string s1, string s2, int i, int j)
		{
			if (i > 0 && j > 0 && s1[i - 1] == s2[j - 1])
			{
				GetDiffTreeFromBacktrackMatrix(lcsMatrix, s1, s2, i - 1, j - 1);
				Console.WriteLine("  " + s1[i - 1]);
			}
			else
			{
				if (j > 0 && (i == 0 || lcsMatrix[i, j - 1] >= lcsMatrix[i - 1, j]))
				{
					GetDiffTreeFromBacktrackMatrix(lcsMatrix, s1, s2, i, j - 1);
					Console.WriteLine("+ " + s2[j - 1]);
				}
				else if (i > 0 && (j == 0 || lcsMatrix[i, j - 1] < lcsMatrix[i - 1, j]))
				{
					GetDiffTreeFromBacktrackMatrix(lcsMatrix, s1, s2, i - 1, j);
					Console.WriteLine("- " + s1[i - 1]);
				}
			}
		}
	}

public class Program
	{
		static void Main(string[] args)
		{
			string s1 = "1ac";
			string s2 = "abc";

			LCS.GetDiffTreeFromBacktrackMatrix(LCS.GetLongestCommonSubsequenceMatrix(s1, s2), s1, s2, s1.Length, s2.Length); 

			Console.ReadKey();
		}
	}
}

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit Post to StumbleUpon

Tags: , , , , , ,

10 Responses to “Longest Common Subsequence – Diff Algortithm in C#”

  1. Interesting post. I wonder if it would be useful to include this code as a
    feature in MonoDevelop, for example: a command in the context menu to be able to “compare” files.

  2. Jay Parzych says:

    I have been using this one for the past year or
    so.

    http://razor.occams.info/code/diff/

    even though the author recently put a disclaimer on the site.

    here is another
    one:

    http://www.menees.com/DiffDotNet.htm

  3. alexmipego says:

    I know about the 2 tools you
    mention and a few more pieces of code arround. Some have licensing issues and/or are not supported anymore. So I took the challenge of doing one on my own… I mean, isn’t that algorithm sooooooo
    pretty?

    I think I’m in love with it…

  4. Jay Parzych says:

    indeed it is

  5. Scott Peterson says:

    I think
    you could save yourself an IL instruction or two by doing this:

    if (i == 0) {
    if (j == 0)
    lcsMatrix[i, j] = 0;
    else
    lcsMatrix[i, j] = Math.Max(0, lcsMatrix[i, j –
    1]);
    } else {
    if (j == 0)
    lcsMatrix[i, j] = Math.Max(lcsMatrix[i – 1, j], 0);
    else
    lcsMatrix[i, j] = Math.Max(lcsMatrix[i – 1, j], lcsMatrix[i, j –
    1]);
    }

  6. alexmipego says:

    @Scott,

    I guess you’re right,
    I didn’t spend much time optimizing that code yet :) I spent much time looking at python and other language example so I was under their influente at the late hour I wrote that code
    😉

  7. Peter Chappell says:

    Nice work,

    Why don’t you apply diff between your algorithm and this one
    http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_subsequence.

    😀

  8. alexmipego says:

    @Peter,

    I used that as a
    base although there are small differences 😉 I plan to optimize the code and insert some other optimizations I’ve been reading about on several sources including the subversion source
    code.

    And of course, once I’ve completed the generics implementation it will be noteworthy do add it to the wikibooks.

  9. matthias says:

    sthg. isn’t working correct i think.
    test it with:
    string s1 = “HelloWorld”;
    string s2 = “HelloWOrld”;
    the result is:

    H
    e
    l
    l
    – o
    – W
    o
    + W
    + O
    r
    l
    d

    But in fact i just deleted o and added O…
    Is this a normal behavior?

  10. Denis says:

    This code isn’t working correct it’ll crash on

    string s1 = “1ac”;
    string s2 = “abcd”;
    The correct one is:

    static int[,] FillLcs(string source, string target)
    {
    var lcsMatrix = new int[source.Length + 1,target.Length + 1];
    for (var sourceIndex = 0; sourceIndex < source.Length; sourceIndex++)
    for (var tagetIndex = 0; tagetIndex 0 && j > 0 && X[i – 1] == Y[j – 1])
    {
    printDiff(C, X, Y, i – 1, j – 1);
    Console.WriteLine(” ” + X[i – 1]);
    }
    else
    {
    if (j > 0 && (i == 0 || C[i, j – 1] >= C[i – 1, j]))
    {
    printDiff(C, X, Y, i, j – 1);
    Console.WriteLine(“+ ” + Y[j – 1]);
    }
    else if (i > 0 && (j == 0 || C[i, j – 1] < C[i – 1, j]))
    {
    printDiff(C, X, Y, i – 1, j);
    Console.WriteLine("- " + X[i – 1]);
    }
    else
    Console.WriteLine();
    }
    }

Leave a Reply

For spam filtering purposes, please copy the number 5319 to the field below: