One of the important characteristics of modern software systems is its ability to be upgraded, which may be called “upgradability.” Old software is continuously being replaced by newer versions, and code reusability and modular development are major features of software design.
Accuracy
When software is upgraded from an old version to a new version, complete accuracy is vital. Every bit in the newly upgraded software in the target computer must match exactly with the new software from at its media source. Otherwise, the new software may operate incorrectly or not at all.
To assure complete accuracy, conventional techniques completely replace the old software with the new software. As software programs (in particular major application suites and operating systems) grow in size and complexity, this wholesale replacement-to-update scheme becomes more time consuming and frustrating to the customer of such software.
Aggravating matters is a trend to move the source of such updates from local, portable, high-bandwidth removable media (such as a CD-ROM) to remote, centralized, relatively low-bandwidth network servers (such as Internet web servers). While replacing a 100 MB of software may be from a CD-ROM may take several minutes, replacing the same amount of software over a dial-up Internet connection may take several hours.
Herein, “complete accuracy” and “substantially identical” allows for minor and insubstantial differences between the new software as originally produced and the new software as it exists on a user's computer.
Conventional Delta-Patching
Typically, newer versions of software have a few additional portions, as well as some minor changes in older portions. Therefore, the brute force approach of completely replacing the old with the new is overkill. An alternative is to capture these changes into a “patch” so that one can reconstruct the newer version from the older one. Because there are differences between the old and new versions, this technique is sometimes called “diff-patching.” Herein, the differences between the old and new versions are called the “delta” (Δ), thus this diff-patching technique may be called “delta-patching” (or “Δ-patching”).
The problem with delta-patching is accuracy. Identifying what is and is not patched is difficult. If the boundaries of such a patch are not accurately determined, the patched version will be different from the desired new version of the software.
As a result, conventional delta-patching compromises efficiency to achieve accuracy. Generally, the sub-module files, data files, library files, and groups of such files are marked if there is any change whatsoever within them. This means, for example, if one line of source code is changed within a 100 Kb DLL (dynamic link library) file is changed, the entire DLL file is replaced. This is done rather than replacing the fragment in the existing DLL file in part because of the difficulty in selecting the fragment that needs replacing and replacing only that with complete accuracy. However, it is most done because replacing the entire module is more efficient with conventional techniques. A small change in one little fragment might appear to be a change spread all over the entire program.
Although this conventional inefficient delta-patching is more efficient and faster than wholesale replacement of the entire software, it still is not as efficiency as possible. It would be more efficient to patch only those fragments of modules and sub-modules that are different from or non-existent in the old software version. Examples of fragments include subroutines, functions, objects, data structures, interfaces, methods, or portions of any of these.
Invariant Fragment Detection
A prerequisite for detecting fragment deltas is the ability to detect invariance of fragments. In other words, before a program module can be patched, one needs to determine which fragments have not been changed across the two versions. With knowledge of the source code for each version, detecting such invariants and creating a patch is not very difficult.
However, detecting invariance of fragments becomes much more difficult when dealing with binary manifestations of such fragments (with no knowledge of the source code). A major difficulty is the existence of functionally unchanged code that appears different in the differing versions of a program module. Code may undergo no change in its functionality, but it may look different in the two versions due to a variety of reasons. Examples of such reasons include:                Changes in one region of code can cause another (unchanged) region to look different        Two small sequences of binary code may look identical even if they correspond to source code with different functionality        Differences in the register allocation in the two buildsChange Begets Apparent Change        
Often small changes in one portion of the code cause a cascade of changes in nearby and sometimes even far-off regions of code. Consider, for example, the following two source fragments:
Program P1Program P2function f(int p)function g(int p)int a=3, b=4;int b=4, a=3;if (b > p) {if (b > p) {a = p;a = p;return a;return a;
The two functions f and g, located in the two programs P1 and P2, are really the same, apart from a difference of names. Clearly, knowledge of the source code would establish that the “if (b>p)” conditional in each fragment is the same, and need not be patched. However, if their corresponding binaries are examined, the offset of b from the base of the stack would be different in these two fragments. This is because of the declaration of a before b in P1 differs in form from the declaration of b before a in P2. Hence, the binaries of the two fragments will not be identical, even if everything else was the same. Of course, these differences in form are irrelevant in substance, but their resulting binaries are different nevertheless.
Now consider the following snippets:
Program P1Program P2x = f(10)x = g(10)
Assume, for this example, that the functions f and g are defined as in the previous example. Here again, the two calls are identical, because the functions being called as well as the call arguments are identical. However, if the identity of f and g is not known, then the identity of the calls above will also not be discovered. This is an example of how local changes can cascade through potentially far-off regions of code.
Appear Identical, but are not
At times, two binary fragments may look identical even though they correspond to different regions in the structure of the corresponding programs. Consider the following:
Program P1Program P2int a = atoi(argv[1]);int b = atoi(argv[2]);int b = atoi(argv[2]);if (b < 10) return;if (a < 10) return;. . .if (b < 20) return;. . .
The conditionals “if (a<10)” in P1 and “if (b<10)” in P2 might both translate to the same binary code, even though their functionality is different (as is seen clearly by examining their source code). This happens because the offset of b on the stack in P2 may be the same as that of a on the stack of P1. The two variables are clearly different, being defined by different program arguments, as can be see in the context above them. However, comparing their binary equivalents without reference to the source code context above can give the illusion of an identity. A representation of the binary equivalent might look something like:
mov eax, dword ptr[ebp+8h]
cmp eax, 0ah
jge L
ret
L: . . .
Register Allocation
Another problem in detecting identity of binary fragment is caused by register allocation. A change in a portion of code may cause the register allocation to change in nearby regions, even though these latter regions have not been modified. Therefore, when comparing binaries, one has to consider the possibility that what looks like a change of register operands may in fact be an identity disguised by a simple renaming of registers.
Described herein is a technology for generating a minimum delta between at least two program binaries. An implementation, described herein, is given a source program (S) in a binary format and a target program (T) in a binary form. It constructs control flow graphs (CFGs) of each. It matches common blocks of the S's CFGs and T's CFGs. The blocks are matched based upon their content and their local neighborhoods (e.g., d-neighborhoods). In addition, blocks are matched using labels, which are based upon computed hash values. The matching is done in multiple passes where each pass improves the matching by relaxing the criteria for a match. In addition, the register renaming problems is solved so that blocks can be fairly compared.
This described implementation produces an intermediate output, which is the content of unmatched blocks. Such unmatched blocks are those found in T that are not found in S. It generates a set of edge edit operations for merging the unmatched blocks into S. The combination of the unmatched blocks and the edit operations is the delta. To patch S to produce a reconstructed copy of T, the delta is merged with S.
This summary itself is not intended to limit the scope of this patent. For a better understanding of the present invention, please see the following detailed description and appending claims, taken in conjunction with the accompanying drawings. The scope of the present invention is pointed out in the appending claims.