Blog

Hashing Big Files With Style (Getting A Progress Status)

I wanted to hash a lot of files and soon I found how painful it was not knowing how much time it would take because some of the files were huge. I also soon found out there is no support for this on Mono but luck smiled upon me when I found HashAlgorithm.TransformBlock and HashAlgorithm.TransformFinalBlock. So, I hacked together an event driven and async-like hashing class with some goodies:

public class ASyncFileHashAlgorithm
	{
		protected HashAlgorithm hashAlgorithm;
		protected byte[] hash;
		protected bool cancel = false;
		protected int bufferSize = 4096;
		public delegate void FileHashingProgressHandler (object sender, FileHashingProgressArgs e);
		public event FileHashingProgressHandler FileHashingProgress;

		public ASyncFileHashAlgorithm(HashAlgorithm hashAlgorithm)
		{
			this.hashAlgorithm = hashAlgorithm;
		}

		public byte[] ComputeHash(Stream stream)
		{
			cancel = false;
			hash = null;
			int _bufferSize = bufferSize; // this makes it impossible to change the buffer size while computing

			byte[] readAheadBuffer, buffer;
			int readAheadBytesRead, bytesRead;
			long size, totalBytesRead = 0;

			size = stream.Length;
         	readAheadBuffer = new byte[_bufferSize];
            readAheadBytesRead = stream.Read(readAheadBuffer, 0, readAheadBuffer.Length);

            totalBytesRead += readAheadBytesRead;    

            do
            {
                bytesRead = readAheadBytesRead;
                buffer = readAheadBuffer;    

                readAheadBuffer = new byte[_bufferSize];
                readAheadBytesRead = stream.Read(readAheadBuffer, 0, readAheadBuffer.Length);

                totalBytesRead += readAheadBytesRead;    

                if (readAheadBytesRead == 0)
                    hashAlgorithm.TransformFinalBlock(buffer, 0, bytesRead);
                else
                    hashAlgorithm.TransformBlock(buffer, 0, bytesRead, buffer, 0);

				FileHashingProgress(this, new FileHashingProgressArgs(totalBytesRead, size));
            } while (readAheadBytesRead != 0 && !cancel);

			if(cancel)
				return hash = null;

    		return hash = hashAlgorithm.Hash;
		}

		public int BufferSize
		{
			get
			{ return bufferSize; }
			set
			{ bufferSize = value; }
		}

		public byte[] Hash
		{
			get
			{ return hash; }
		}

		public void Cancel()
		{
			cancel = true;
		}

		public override string ToString ()
		{
			string hex = "";
			foreach(byte b in Hash)
				hex += b.ToString("x2");

			return hex;
		}
	}

This class can be used to get an event reporting the progress status of the hashing, and since this is intended to be used in a larger application I decided to make the process cancelable and to add a handy conversion to an hex string. Here is a sample usage code (you can replace SHA1 with MD5 or other hashing algorithm):

static ASyncFileHashAlgorithm hasher = new ASyncFileHashAlgorithm(SHA1.Create());
		public static void Main(string[] args)
		{
			Console.Write("Starting...");
			Stream stream = (Stream)File.Open("bigfile", FileMode.Open);

			hasher.FileHashingProgress += OnFileHashingProgress;

			Thread t = new Thread(
				delegate() { hasher.ComputeHash(stream); }
			);
			t.Start();
			t.Join();

			// hasher.Hash has the byte[] computed hash
			Console.WriteLine(hasher); // ToString() converts to a string representing the hex value of the hash
		}

		public static void OnFileHashingProgress(object sender, FileHashingProgressArgs e)
		{
			Console.WriteLine(e.ProcessedSize + " of " + e.TotalSize);
		}

Now, when I have some more time to spare, I’ll implement some fancy console (and maybe gtk#) interfaces. Hopefully I’ll manage to complete my initial goal of creating a file comparison tool.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit Post to StumbleUpon

Tags: , , , , , , ,

9 Responses to “Hashing Big Files With Style (Getting A Progress Status)”

  1. […] file managers for Linux irssi, putty and unicode characters Using
    consistent data types for columns Mono: Hashing Big Files With Style (Getting A Progress Status) Zend Framework and Rapid Application Development with PHP Spammers choose […]

  2. This is awesome. I’ve been pondering something like this for awhile. At my employer we use primarily Open Source software, and to be blunt ILM (Information Life-cycle Management)
    support in Open Source is pretty darn terrible (bordering on non-existant). Support for things like a data-retention policy involves allot of hackage.

    If there was a system process that would
    periodically checksum files under a list of roots and store the results in something like an SQLite database it would be fantastic. And record their MIME types.

    Then we could easily (1) report
    to departments duplication of files within a reasonably proximation of when the duplication occurred (2) report to departments unchanged files after a certain age (3) report to management data
    demographics [30% Power Point files, 15% Excel, etc…] etc… While this is possible currently, it isn’t easy. Things like shell scripts find fascinating ways of failing when dealing with
    hundreds of thousands of files with user-created names [OMG! You wouldn’t believe the filenames users come up with!].

    Ideally a cron-job that re-hashes modified files and a service that provides
    a simple web-service to that database so it can be queried from an Intranet page could automate, I’d wager, 95% of this crap.

    On the other hand, I wonder if beagle run as a privilaged user could
    be coerced into providing this information? I’ve search around and not been able to find anything on that.

  3. pablo says:

    Hi,

    Great article!

    I’m very interested in hashing. Indeed I have been running some tests myself over the weekend.

    Can you publish some info about the results you
    get?

  4. Seth says:

    IMHO and the good old UNIX spirit fortune smiled on me a lot brighter a lot
    longer ago when I found out that someone built exactly this in a generic fashion for anything pipe!

    It’s name is pv and it works.

    More specificly, it probably works for you like

    pv file.big | md5sum

    or some slightly modified incantation. This has all kinds of added benefits (parallellization?)

    pv file.big | tee >(md5sum) >(shasum) >(mkisofs)
    >(cdrecord)

    From the pv man-page: [QUOTE]
    A more complicated example using numeric output to feed into the dialog(1) program for a full-screen progress display:

    (tar cf – . \
    | pv -n -s â?.??.du -sb . | awk â?.??.?{print $1}â?.??.?â?.??. \
    | gzip -9 > out.tgz) 2>&1 \
    | dialog –gauge
    â?.??.?Progressâ?.??.? 7 70

    Frequent use of this third form is not recommended as it may cause the programmer to overheat.

    [/QUOTE]

    Just thought you might like the
    alternative perspective :)

  5. alexmipego says:

    @pablo
    What exactly are you looking
    for?

  6. Alan says:

    You can make
    it faster still by about 10-15%.

    If you use stream.BeginRead and stream.EndRead to asynchronously read from the disk you can be loading the next block into memory while hashing the current
    block. This is what i do in my own application.

  7. alexmipego says:

    Hey! Nice tip!
    Thanks!

  8. Camillo says:

    How about some fast hashing of very very big files? from 20GB to 100GB.
    How long should be the hashing result string?
    Should it be better to compare hashing strings on smaller chunks?

    Camillo

  9. alexmipego says:

    I guess that depending on your intended objective, comparing chucks would be better. The longer the file, the more chance of an hash collision. Theorically, the longer the hash string, the more reliable it should be, but that isn’t true for every algorithm and there’s no point on picking another hash just for that.

    Make sure you check the timestamp before and after the operation (to make sure no changes were made in between) and use chucks to make it faster.

Leave a Reply

For spam filtering purposes, please copy the number 1494 to the field below: