Computer programs are typically written in a high level language or assembly language. The high level language listing of a computer program provides complete information about the program and its algorithms, and is easily read and understood by other programmers. In the process of producing an executable computer program in a high level language that is not interpreted, the high level language program instructions are first compiled into a relocatable object file, which is a binary version of the program. While an object file is not readable in the same sense as the high level language program, object files contain significant information that allows them to be understood and processed by other programs. For example, object files typically contain symbol definitions, types, and names for every function or global variable used in the program; these definitions, types, and names indicate whether the referent of the symbol is code or data. An object file may also contain debugging information relating the instructions and data in that file to source language constructs. It is thus relatively straightforward to process an object file to determine its components, based on the defining information provided in the object file.
In the final step to produce a distributable software program, the compiled object files of the program are linked into a binary executable program. In contrast to object files, a binary executable software program contains only a very small subset of the defining information contained in the corresponding object file(s). For example, a binary executable software program will have definitions only for functions and global variables explicitly exported by that program. The defining information in a binary executable software program does not include internal branch targets, includes only a subset of the functions and global variables, and does not provide any type information. In particular, a binary executable software program does not include any mechanism that distinguishes between code and data components.
Software programs are distributed in the form of binary executables because this is the format in which the program will be loaded and executed on a computer (to implement the functions defined by the program), and in part, in order to obscure many of the details of the program. Some binary executables are more difficult than others to understand, such as those targeted for the Intel Corporation's "x86" architecture, i.e., programs written to employ machine instructions that execute on the family of processors identified by the x86 suffix, such as the 80386, 80486, 80586 (or PENTIUM), etc. Because x86 machine instructions are not of a fixed length, an instruction for this family of processors can potentially start on any arbitrary byte boundary, making it extremely difficult to differentiate code portions from data portions, in contrast to reduced instruction set computer (RISC) processors, such as the Digital Equipment Corporation's ALPHA processors, for which the differentiation between code and data in an executable file is more straightforward. However, the need frequently arises, for reasons of analysis, performance evaluation, security, or error checking, to examine a binary executable software program (through software means), to understand its structure, and possibly to introduce changes, producing a modified binary executable software program that is related to the original program, because it provides the same functions, but is also able to provide additional functionality or operate more efficiently. Accordingly, it will be apparent that a method for determining the structure of arbitrary instruction length (e.g., based on x86 architecture) binary executable software programs is required in order to satisfy such needs. Currently, a solution to this problem does not appear to exist in the prior art.