A computer program is understood to mean a plurality of program instructions which are to be executed in a computer system by one or more microprocessors in a particular order. So that the program instructions can be executed by the microprocessor(s), they are in a binary format which the microprocessor can understand directly and which is specific to the executing microprocessor. The work instructions in this format which is specific to the microprocessor are usually able to be understood by a human observer, for example a programmer, only with very great difficulty or not at all. In order for the writing of a computer program to be simplified or made possible in the first place, programming languages called high-level languages are usually used today. High-level languages of this kind involve program instructions which are to be executed by the microprocessor being formulated by the programmer in a language which is able to be understood by humans and being stored in what is known as a source text. So that the program instructions contained in the source text can be executed by the microprocessor, transformation into the specific format which the microprocessor can execute is necessary. There are basically two possibilities in this regard: firstly, the entire source text can be translated completely into the format specific to the microprocessor prior to execution by the microprocessor; secondly, it is possible for each program instruction in the source text to be first of all interpreted by a further computer program, what is known as the interpreter, and converted into the program instructions which are necessary for execution of the work instruction and which are specific to the microprocessor. Similarly, hybrid forms are known, as are implemented in the programming language Java, for example: in this case, the source text is first of all translated completely into a bytecode which is not yet specific to the microprocessor, and the bytecode is subsequently interpreted in order to produce program instructions which are specific to the microprocessor. For reasons of efficiency, a source code is nowadays predominantly translated completely before the first execution of the program instructions by a microprocessor. This involves the use of what is known as a compiler.
The program instructions to be processed in a particular order by a microprocessor are usually not in this specific order in a computer memory system. On the contrary, one or more program instructions to be executed in direct succession have been respectively combined to form groups, said groups being connected to one another by program flow instructions, which are in the form of jump instructions or function calls, for example. This structure of a computer program is regularly also reflected in the associated source text written in a high-level language, said source text likewise being divided into functions or subprograms and having blocks of program instructions which are connected to one another by branches or jump instructions. However, there is generally no explicit association between blocks of program instructions in the source text and groups of program instructions in the microprocessor-specific format.
The translation of a source text which is in a high-level language into microprocessor-specific program instructions by means of a compiler does not result in an explicitly determined computer program, i.e. in a necessarily explicitly defined sequence of microprocessor-specific program instructions which are in binary format. This is the case firstly when the high-level-language source text is translated for execution on different microprocessors which have no command compatibility with one another. However, even if the translation is produced for an identical type of microprocessor, different translations of an identical high-level-language source text can result in computer programs which are different in binary format. One reason for this can be regarded as optimizations which the compiler performs in order to attain a computer program which can be executed as efficiently as possible. Thus, a change in the execution order of program instructions, the inversion of jump conditions and the combination of program instructions which are actually connected to one another by a jump command to produce a related group are common optimization processes for compilers. Depending on the choice of the degree of optimization by the compiler and other ambient conditions, very different computer programs therefore arise from an identical high-level-language source code as a result of translation by means of the compiler. A problem in this case is that the presence of two computer programs which differ in terms of their specific sequence of microprocessor instructions without the presence of the source text means that it is not possible to establish whether these have actually been produced by translating the identical source text. This applies even more so for translations of an identical source text using different compilers or for different target microprocessors.
It is admittedly possible to convert computer programs into an assembler source text which humans can read but which is at machine level by means of reverse translation (disassembly). However, a problem in this case is that firstly the preceding translation of the high-level-language source text into the microprocessor-specific computer program loses important information carriers, such as function or variable names; secondly, the assembler source texts obtained through disassembly reproduce the optimizations performed by the compiler, so that even comparison of two assembler source texts obtained through reverse translation does not allow inference of the identity of the original high-level-language source text. This relatively simple alterability of the microprocessor-specific presentation of a computer program without changing or essentially changing the actual high-level-language source text is utilized particularly by malware, for example computer viruses, computer hacking tools and so on in order to make it difficult to recognize malware in running computer systems. Since the high-level-language source text of a piece of malware is usually unknown, it is possible to identify destructive programs in the course of computer operation only by comparing the computer program which is present in the computer memory system in its microprocessor-specific form with already known microprocessor-specific forms of destructive programs. Simple retranslation of the high-level-language source text of the destructive program provides the opportunity to obtain a computer program whose binary presentation has been altered such that the computer program can no longer be recognized as harmful by current antivirus software.
A reliable method for recognizing matches or differences between two or more computer programs which are not known in the source text is also desirable outside of the recognition of malware, for example in order to recognize inadmissible changes in a computer program, to make it possible to establish differences between various versions of a computer program or to detect inadmissible use of protected source texts.
The document by Thomas Dullien, Rolf Rolles, “Graph-based comparison of Executable Objects”, which appeared in the conference volume of the Symposium sur la Securite des Technologies de I'Information et des Communications 2005, Rennes, France, Jun. 2, 2005, describes a method for comparing two computer programs held in a computer memory system. The aim is to determine the degree of match or discrepancy between the two computer programs, which are not in a high-level-language source text. The method works as follows: first of all, the two computer programs which are in a microprocessor-specific format are reverse translated in order to obtain a respective assembler source text. Next, the computer program is broken down into computer program sections, each of the computer program sections comprising precisely one function or precisely one subprogram of the computer program. The computer program sections obtained are connected to one another by program flow instructions in the form of function calls or subprogram calls, so that a program flow relationship is defined between the computer program sections. The program flow relationship can be presented in the form of a first directed graph, known from mathematical graph theory, wherein the computer program sections define nodes and the program flow instructions connecting the computer program sections to one another define edges of the first directed graph. In this case, an edge connects a respective first computer program section (source node) to a second computer program section (destination node), the direction of the edge being prescribed by a program flow instruction which points from the source node to the destination node. The totality of the nodes and edges maps an abstract program flowchart for the computer program. The subdivision of the computer program into computer program sections is followed by breakdown of each of the computer program sections into segments, wherein each of the segments is defined by directly successive instructions and wherein a program flow relationship between the segments is defined by jump instructions, for example conditional instructions or loop instructions. The program flow relationship of the segments can be presented for each of the computer program sections in the form of a second directed graph, wherein the segments define nodes and the program flow instructions connecting the segments to one another define edges on the second directed graph. The totality of the nodes and edges on the second directed graph maps an abstract program flowchart for the respective computer program section. Each node on the first directed graph can be represented by the second directed graph which corresponds to the associated computer program section in order to obtain a complete, abstract program flowchart for the computer program. The comparison between the two computer programs held in the computer memory system is now made by comparing the respective ascertained complete abstract program flowcharts, that is to say by comparing the complete first directed graphs, which each contain all the second directed graphs. An advantage in this context is that, for example, optimizations in the compiler mean that differences in the microprocessor-specific binary presentation of the computer programs which are caused during the translation of the high-level-language source text do not result in discrepancies, or result in only a few discrepancies, in the abstracted program flowcharts, so that functionally matching and functionally different or altered areas of the computer programs can be identified with a high level of reliability. However, a drawback is that the complete comparison of the abstract program flowcharts is very complex and cannot be performed with complete automation. For the purpose of automation, therefore, a simplified comparison is performed which involves the number of respectively ascertained nodes and edges being compared in order to establish a match or discrepancy in the computer programs. However, this method has the drawback of high susceptibility to error, since a match in the programs which is actually not present is established if by chance the graphs which do not match one another have the same number of nodes or edges.
U.S. Pat. No. 7,207,038 B2 describes a method for producing flowcharts for an executable computer program. The method comprises subdivision of the computer program held in a computer memory system into computer program sections which are connected to one another by function calls or jump instructions, and the creation of a flowchart structure on the basis of the identified computer program sections. The aim is to optimize a computer program which is not known in the high-level-language source text in terms of the efficiency of its flow by altering the order of function calls.
The document G. R. Thomson et al., “Polymorphic Malware Detection and Identification via Context-Free Grammar Homomorphism”, Bell Labs Technical Journal 12(3), 2007, pp. 139-147 describes a method for malware detection, in which a computer program being suspected to be malware is broken down into sections being defined by functions of the computer program code. A control flow graph is constructed for each respective section, and the sections are sorted and numbered as per the length of the longest simple path through the respective control flow graph. Afterwards, a grammatical rule, describing mutual function calls of the sections, is constructed from the control flow graphs. In order to characterize the computer program, the constructed grammar rules are serialized into a single string. A drawback of the described method is that only a single serialized string of undefined length is constructed for identification of the computer program, rendering comparison of two different computer programs by comparing the resulting serialized strings impossible. Furthermore, even small modifications of a program code may lead to serious changes in the constructed serialized string.