Thursday, February 3, 2011

Binary patch-generation in C#

Does anyone have, or know of, a binary patch generation algorithm implementation in C#?

Basically, compare two files (designated old and new), and produce a patch file that can be used to upgrade the old file to have the same contents as the new file.

The implementation would have to be relatively fast, and work with huge files. It should exhibit O(n) or O(logn) runtimes.

My own algorithms tend to either be lousy (fast but produce huge patches) or slow (produce small patches but have O(n^2) runtime).

Any advice, or pointers for implementation would be nice.

Specifically, the implementation will be used to keep servers in sync for various large datafiles that we have one master server for. When the master server datafiles change, we need to update several off-site servers as well.

The most naive algorithm I have made, which only works for files that can be kept in memory, is as follows:

  1. Grab the first four bytes from the old file, call this the key
  2. Add those bytes to a dictionary, where key -> position, where position is the position where I grabbed those 4 bytes, 0 to begin with
  3. Skip the first of these four bytes, grab another 4 (3 overlap, 1 one), and add to the dictionary the same way
  4. Repeat steps 1-3 for all 4-byte blocks in the old file
  5. From the start of the new file, grab 4 bytes, and attempt to look it up in the dictionary
  6. If found, find the longest match if there are several, by comparing bytes from the two files
  7. Encode a reference to that location in the old file, and skip the matched block in the new file
  8. If not found, encode 1 byte from the new file, and skip it
  9. Repeat steps 5-8 for the rest of the new file

This is somewhat like compression, without windowing, so it will use a lot of memory. It is, however, fairly fast, and produces quite small patches, as long as I try to make the codes output minimal.

A more memory-efficient algorithm uses windowing, but produces much bigger patch files.

There are more nuances to the above algorithm that I skipped in this post, but I can post more details if necessary. I do, however, feel that I need a different algorithm altogether, so improving on the above algorithm is probably not going to get me far enough.


Edit #1: Here is a more detailed description of the above algorithm.

First, combine the two files, so that you have one big file. Remember the cut-point between the two files.

Secondly, do that grab 4 bytes and add their position to the dictionary step for everything in the whole file.

Thirdly, from where the new file starts, do the loop with attempting to locate an existing combination of 4 bytes, and find the longest match. Make sure we only consider positions from the old file, or from earlier in the new file than we're currently at. This ensures that we can reuse material in both the old and the new file during patch application.


Edit #2: Source code to the above algorithm

You might get a warning about the certificate having some problems. I don't know how to resolve that so for the time being just accept the certificate.

The source uses lots of other types from the rest of my library so that file isn't all it takes, but that's the algorithm implementation.

  • It might be worth checking out what some of the other guys are doing in this space and not necessarily in the C# arena either.

    This is a library written in c#

    SVN also has a binary diff algorithm and I know there's an implementation in python although I couldn't find it with a quick search. They might give you some ideas on where to improve your own algorithm

    Simon Buchan : SVN uses the xdelta algorithm (from a look at the source, at least)
    From lomaxx
  • @lomaxx, I have tried to find a good documentation for the algorithm used in subversion, called xdelta, but unless you already know how the algorithm works, the documents I've found fail to tell me what I need to know.

    Or perhaps I'm just dense... :)

    I took a quick peek on the algorithm from that site you gave, and it is unfortunately not usable. A comment from the binary diff file says:

    Finding an optimal set of differences requires quadratic time relative to the input size, so it becomes unusable very quickly.

    My needs aren't optimal though, so I'm looking for a more practical solution.

    Thanks for the answer though, added a bookmark to his utilities if I ever need them.

    Edit #1: Note, I will look at his code to see if I can find some ideas, and I'll also send him an email later with questions, but I've read that book he references and though the solution is good for finding optimal solutions, it is impractical in use due to the time requirements.

    Edit #2: I'll definitely hunt down the python xdelta implementation.

  • Sorry I couldn't be more help. I would definately keep looking at xdelta because I have used it a number of times to produce quality diffs on 600MB+ ISO files we have generated for distributing our products and it performs very well.

    From lomaxx
  • Yes, xdelta is good. It does, however, work on relatively small windows (100kb if I'm not mistaken), but with a working implementation of it I could easily tweak that for our data.

    The window size was chosen for speed for subversion if I'm not mistaken, but our code can easily run a bit longer, as long as it doesn't need to take all night (which my current implementation does).

  • If this is for installation or distribution, have you considered using the Windows Installer SDK? It has the ability to patch binary files.

    http://msdn.microsoft.com/en-us/library/aa370578(VS.85).aspx

    From TimM
  • Have you seen VCDiff? It is part of a Misc library that appears to be fairly active (last release r259, April 23rd 2008). I haven't used it, but thought it was worth mentioning.

    Lasse V. Karlsen : Thanks, I'll look at VCDiff.
  • This is a rough guideline, but the following is for the rsync algorithm which can be used to create your binary patches.

    http://rsync.samba.org/tech_report/tech_report.html

    From jtalarico

0 comments:

Post a Comment