Computer programs typically are written in high-level programming languages. High-level programming language code typically is architecture independent. Compilation generally involves transforming the high-level architecture-independent code into low-level architecture-specific code, which maintains the original meaning. For example, a computer program may be written in C and transformed into an x86 assembly language. Although somewhat less intuitive than most high-level programming languages, compilers oftentimes will generate human-readable and human-understandable ASCII text as output. Assembly language thus is sometimes called symbolic machine code.
Assembly language code generally will be transformed into a sequence of binary values, or object code, that conforms to the instruction set specification of a target central processing unit (CPU) via an assembler. In other words, an assembler will receive assembly language code as input and generate machine language code executable by a CPU. Machine language/object code is not encoded in ASCII format and thus is neither human-readable nor human-understandable.
An assembler program creates object code by, among other things, translating combinations of mnemonics and syntax for operations and addressing modes into their numerical equivalents. This representation typically includes an operation code (opcode), as well as other control bits and data. The assembler also calculates constant expressions and resolves symbolic names for memory locations and other entities, and oftentimes also will perform other tasks such as, for example, optimizations, etc.
Many programs are distributed only in machine code form. However, it sometimes might be desirable to disassemble a binary or sequence of object code. For example, it may be desirable to disassemble a binary or sequence of object code to test for vulnerabilities or potential exploits, to replay applications for forensic analysis or the like, for reverse engineering purposes, etc.
A disassembler is a computer program that translates machine language into assembly language and, thus, is at least some respects the “inverse” of an assembler. The output of a disassembler typically is formatted for human-readability rather than suitability for input to an assembler (e.g., and thus typically is not suitable for use in (re-)generation of machine language code).
Those skilled in the art know that it can be difficult, and sometimes impossible, to generate completely accurate disassemblies. Similarly, those skilled in the art know that it can be difficult, and sometimes impossible, to generate reassemblable disassemblies.
For example, it can be difficult to know which of plural semantic equivalents is/are used when disassembling machine code. Consider, for example, typical many-to-one functionality of the assembler instead, e.g., as the same sequence of bytes may disassemble to one sequence of instructions when starting disassembly from the first byte, or may disassemble to a second unrelated sequence of instructions when starting disassembly from the second byte. This is exacerbated by that fact that, as those skilled in the art know, it is difficult to determine the offsets at which instructions start. As another example, optimizations run on the assembly code by the assembler that are reflected in the corresponding machine code may not be translated back into the “un-optimized” original assembly code. Data interpretations issues can also arise. Consider four opcodes, with instruction a being 0101, instruction b being 0011, instruction c being 01010011, and instruction d being 00110101. The sequence 0101001100110101 thus could be interpreted as abba, cd, cab, etc.
Generally speaking, and as can be appreciated from the examples above and the experience of those skilled in the art, the generation of a reassemblable disassembly is complicated for several reasons. First, assembly and compilation processes are lossy. At the machine language level, there are no variable or function names, and variable type information can be determined only by how the data is used rather than explicit type declarations. For instance, a transfer of (for example) 32-bits of data could involve a 32-bit integer, a 32-bit floating point value, a 32-bit pointer, etc. The fact that a value could be a memory address or a symbol (value) can be particularly problematic in terms of generating a reassemblable disassembly, as it can be unclear whether a sequence is pointing at a symbol (value) or a pointer to a completely different symbol (value).
Second, assembly and compilation processes are many-to-many operations. A source program can be translated to assembly language in many ways, and machine language can be translated back to source in many different ways. Indeed, compilers and assemblers can be very language, library, architecture, and/or otherwise specific, so (for example) disassembling machine code for equivalent programs produced using different assemblers and/or compilers can yield very different results.
Because the emphasis generally is on human understandability with respect to the disassembly, and because equivalent functionality may be reproduced in at least some instances, the inability to create an exact replica of a disassembly and/or an exact replica of machine code from a disassembly is/are not necessarily problematic in all instances. Some ambiguity and incorrectness might well be tolerable in many scenarios. Indeed, inaccuracies can be corrected manually in some instances, and simply “tolerated” or “accepted” in others. Thus, the fact that there is a strong relationship between assembly language and machine language oftentimes is seen as sufficient, even though there is not a one-to-one mapping between assembly language and machine language.
Unfortunately, however, these problems can become exacerbated and unacceptable if the disassembled code needs to be accurate for some reason. For instance, ambiguities, inaccuracies, and the like, may be unacceptable in applications geared towards identifying security vulnerabilities, assessing mission-critical operations, etc. The same may hold true where disassembled code needs to be modified prior to reassembly.
Thus, it will be appreciated that it would be desirable to address the above-described and/or other issues. For instance, it will be appreciated that it would be desirable to generate more accurate disassemblies and/or reassemblable disassemblies. Certain example embodiments help in these and/or other regards.
In certain example embodiments, a method of disassembling an executable is provided. The method includes: parsing the executable, and decoding possible instructions in the executable in connection with the parsing; generating an initial fact database comprising the possible instructions; generating an enhanced fact database by executing a plurality of inference modules on the initial fact database, at least some of the inference modules being expressed in a declarative query language and including (a) a code inference module structured to compute valid instructions organized in blocks of code, (b) a symbolization module structured to disambiguate between symbols and memory addresses, and (c) a function inference module structured to identify functions; and organizing content from the enhanced fact database into a format of valid assembler code.
According to certain example embodiments, the declarative query language may be Datalog.
According to certain example embodiments, one or more of the inference modules may implement a soft heuristic in addition to hard rules for fact generation. In some instances, all hard rules and soft heuristics may be encoded into Datalog rules. In certain example embodiments, execution of a Datalog engine on the Datalog rules may result in a consistent fact universe for the initial fact database and the enhanced fact database.
According to certain example embodiments, new hard rules and/or new soft heuristics may be definable and suitable for use in generating facts for the initial fact database and/or enhanced fact database, independent of existing hard rules and/or soft heuristics.
According to certain example embodiments, one or more of the inference modules may implement a heuristic by: generating a problem/solution space for the issue for which evidence is to be built and/or for which a conflict is to be resolved; subjecting at least some of the members in the problem/solution space to rules that assign points to different outcomes related to the issue for which the evidence is to be built and/or for which the conflict is to be resolved; determining which one or more members of the problem/solution space has/have the most points; and admitting to the enhanced fact database the one or more members of the problem/solution space determined to have the most points. A heuristic may be implemented for code block detection in the code inference module, for example.
According to certain example embodiments, the symbolization module may implement heuristics for determining that an array likely is present based on the presence of a plurality of evenly-spaced symbols, determining that an accessed address likely is a valid pointer based on a size of the associated access being pointer-sized, determining that a pointer candidate in what appears to be a string is less likely to be a valid pointer, and/or determining that a pointer candidate that is aligned is more likely to be a valid pointer.
According to certain example embodiments, the symbolization module may include definition to use chain analysis, value analysis, and/or data access analysis.
According to certain example embodiments, the function inference module may use symbol information and heuristics to identify a first set of functions, and attempts to add a second set of functions by finding blocks of code that are contiguous to, but not reachable from, a complete function in the first set of functions.
According to certain example embodiments, the valid assembler code may be assembleable into a valid executable.
According to certain example embodiments, one or more of the inference modules may be configured to receive additional rules from a user and/or from additional programmatic analysis.
Counterpart system, computer program, and/or non-transitory computer readable storage media also are contemplated herein. For instance, in certain example embodiments, a system for disassembling an executable includes a non-transitory computer readable storage medium. Processing resources including at least one memory and a hardware processor, the processing resources being configured to: receive the executable; parse the executable, and decode possible instructions in the executable in connection with the parsing; generate an initial fact database comprising the possible instructions, the initial fact database being stored to the non-transitory computer readable storage medium; generate an enhanced fact database by executing a plurality of inference modules on the initial fact database, at least some of the inference modules being expressed in a declarative query language and including (a) a code inference module structured to compute valid instructions organized in blocks of code, (b) a symbolization module structured to disambiguate between symbols and memory addresses, and (c) a function inference module structured to identify functions, the enhanced fact database being stored to the non-transitory computer readable storage medium; and organize content from the enhanced fact database into a format of valid assembler code. Similarly, in certain example embodiments, there is provided a non-transitory computer readable storage medium tangibly storing a program that, when executed by a computing system including at least one processor, is configured to disassemble an executable, by performing functionality comprising: parsing the executable, and decoding possible instructions in the executable in connection with the parsing; generating an initial fact database comprising the possible instructions; generating an enhanced fact database by executing a plurality of inference modules on the initial fact database, at least some of the inference modules being expressed in a declarative query language and including (a) a code inference module structured to compute valid instructions organized in blocks of code, (b) a symbolization module structured to disambiguate between symbols and memory addresses, and (c) a function inference module structured to identify functions; and organizing content from the enhanced fact database into a format of valid assembler code. The features described in the preceding paragraphs and those set forth in more detail below may be used with these counterparts, as well.
These aspects, features, and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.