A programmer can input (e.g., keyboard, microphone) a set of source code (e.g., C++ formatted text, Delphi formatted text) into an integrated development environment (IDE) having a compiler (e.g., Visual Studio, Borland). Then, the programmer can request the compiler to compile the set of source code into a binary file (e.g., executable binary file). Subsequently, a researcher having access to the binary file may want to reverse engineer the binary file into the set of source code for design recovery purposes (e.g., security auditing, digital rights management, driver engineering). However, the source code may be unavailable to the researcher due to limitations on contractual data rights, use of legacy software components, inclusion of third party libraries, or code obfuscation. Resultantly, the researcher can use an analytical tool (e.g., bus analyzer, packet sniffer), a disassembler, or a decompiler in order to understand how the binary file operates or in order to access the set of source code. However, these approaches are technically problematic for several reasons.
First, the analytical tool can often produce false positives, which divert the researcher in unnecessary or undesired ways. As such, the researcher still needs to analyze the set of source code and verify the set of source code as being sourced from the binary file. Therefore, this approach is unreliable, time-consuming, and laborious.
Second, since the disassembler disassembles the binary file into a set of raw machine code, which is relatively complicated, the researcher also needs to be skilled in understanding the set of raw machine code. This skillset is generally rare. Further, even if the researcher is skilled in understanding the set of raw machine code, the researcher may still spend an excessive amount of time/resources in analyzing the set of raw machine code, especially when the binary file involves complex/dependent computation (e.g., graphics, compilers, gaming, simulation, medical software). Additionally, the disassembler generally tends to target a specific hardware architecture (e.g., x86, ARM), thereby making disassembly difficult if the binary file is compiled for a hardware architecture that is different from what the disassembler has targeted originally.
Third, the decompiler rarely, if ever, produces an output that closely resembles the set of source code that was originally input by the programmer, especially when the binary file involves complex/dependent computation (e.g., graphics, compilers, gaming, simulation, medical software). Usually, the output is a mangled version of the set of source code. At best, the output can be functionally equivalent to the set of source code, but usually is structurally different therefrom. One potential reason why the output may be structurally different from the set of source code, as originally input into the IDE, may be due to the compiler optimizing the set of source code for various purposes (e.g., minimize execution time, minimize memory usage, minimize power usage). For example, when the compiler compiles the set of source code for a specific computing architecture, then the compiler performs various optimizations particular to that computing architecture (e.g., minimize application size on disk, increase execution speed). Some examples of particular optimizations can include loop optimization, data flow optimization, code generation, or others. For example, a while loop written in a C programming language may be expanded or unrolled in order to eliminate at least some extra instructions that may decrease an execution speed of a resulting binary file. Subsequently, if that binary file is disassembled and ultimately decompiled, then a resulting set of high-level source code more closely resembles a low-level assembly language source code than the set of source code that was written by the programmer, especially when the binary file involves complex/dependent computation (e.g., graphics, compilers, gaming, simulation, medical software). Furthermore, the decompiler may only target a specific programming language. Therefore, the decompiler may be unable to generate a high level source code in a programming language that the binary file was originally written in, i.e., the set of source code.