For a number of software engineering applications, it would be helpful to know how two related versions of a computer program compare. In particular, if changes are made to a xe2x80x9cbaseline versionxe2x80x9d of a program, resulting in a newer or updated version, and if source code is available for both versions, the source code difference of the baseline and current versions is easy to obtain through standard textual comparison tools, such as the UNIX xe2x80x9cdiffxe2x80x9d command.
There are two major problems with this approach. First, the source code may not be available, especially for the older baseline version. Second, and more fundamentally, a source-code difference does not directly point out all the portions of a program that may have different semantics. For instance, if the type, or format, of a program variable is changed, then all the executable code, i.e., computation and logic, that mentions or references that variable will in general be different as well.
For software testing applications, it is desirable to know which code should be re-tested when a program is modified. As shown above, the source code difference is generally insufficient. While this problem can be addressed through additional source-level tools, such as dataflow slicing, that is, determining the dataflow representation for a program, a more direct approach is to compare the executable program binaries obtained by compiling the source code into machine code which incorporates any changes such as, for example, variable format changes.
Naively comparing program binaries leads to an overwhelming number of xe2x80x9cfalse positives,xe2x80x9d or insignificant differences, since, for example, adding a line of source code will tend to induce large-scale differences in the new binary, because instruction displacements, that is, explicit distances encoded in instructions, and register assignments, which define exactly which fast hardware memory locations are used, will differ throughout the program.
An embodiment of the present invention accurately finds the different and similar portions of two binaries related by small changes, and can form a mapping between, or correlating, the similar portions, such that information pertaining to the baseline binary can be applied to the current binary.
Therefore, in accordance with the present invention, a method for determining changed code in a second program binary relative to a first or baseline program binary, where the second program is a different version of the first program, includes the step of translating machine addresses of both program binaries to symbols. The first and second program binaries are disassembled using the translated symbols. Differences between the two resulting disassemblies are determined, and a list of the differences is created.
The second program can be an updated version of the first program, or more generally, the first and second programs can simply be two different versions of a program.
Preferably, a symbol table, an address range table, and/or a control flow structure are determined for each of the program binaries, and used to translate machine addresses.
Preferably, differences between the disassemblies, which correspond to differences between the program binaries, are determined by textually comparing the disassemblies, with a utility such as the xe2x80x9cdiffxe2x80x9d program provided by the UNIX operating system, or some other text comparison program.
Each disassembly contains a sequence of instructions, and each instruction occupies a line. For efficiency, each disassembly is preferably transformed into a sequence of xe2x80x9cblock-instructions,xe2x80x9d where a block-instruction contains, in a single line, all of the instructions from within a block, and where a block contains a sequence of instructions which ends in a branch. The blocked-instructions from the two versions are then compared using xe2x80x9cdiff,xe2x80x9d or a similar program or function.
The set of changed blocked-instructions thus determined can be further refined by breaking each changed blocked-instruction into its component instructions, so that each instruction occupies a line. Again using diff on the instructions within the blocks marked as changed, it is determined which instructions have changed.
Alternatively, differences between the program binaries can be determined by first determining control flow graphs of the disassemblies, and using graph-matching techniques to determine the differences between the control flow graphs.
The list of differences can be correlated to differences between the source statements, and presented to a user, for example, in printed form or on a display, or alternatively, the list can be saved in a file or passed to another process for further processing. For example, the list can be used to aid in test coverage analysis, code change analysis, or failure analysis, among other analyses.
Changes in the second version relative to the first or baseline version may result, for example, by inserting instructions into the first program, or by modifying instructions in the first program, or by deleting instructions from the first program. One change might be where a variable""s size is different in the second program binary relative to the first program binary. This could result, for example, from a change in source code, or from use of a different compiler, or even from the same compiler with different options selected. Similarly, changes in the second version relative to the first version may result from a change to a data structure""s definition.
In at least one embodiment, known attributes of the compiler(s) which created the program binaries can be used in translating symbols and disassembling binaries. An example is where a known attribute is a standard base register.
Machine addresses can be, but are not limited to, for example, register names, memory addresses including both virtual and physical addresses, and address offsets.
According to another aspect of the present invention, a method for analyzing changed code coverage of a second version of a program relative to a first version, includes marking code in the second program which is changed or different from the first program. The second program is then executed in a test environment, and code which is executed is marked as having been executed. Next, the second program is executed in a non-test environment, such as a production environment, and code which is executed in this second environment is marked accordingly. Finally, from the variously marked code, a list of changed code which have not executed in the test environment but have executed in the non-test environment is provided.
Code can be marked by various groupings, such as, for example, individual code lines, or basic blocks.
In certain applications, coverage results can be obtained on a production run of a baseline program and mapped to a program under test, to determine which portions of the program under test have not executed in the production environment.