It is well-known that in the maintenance of large computer programs it is not uncommon for multiple versions of the programs to be created. For example, multiple versions of a particular program may be created to incorporate new features, bug fixes or related versions of the same program having different feature sets. The detection and identification of these multiple versions of programs is a critical issue for system administrators, programmers, system developers and the like, who use, modify and maintain these programs.
There are a number of known techniques for detecting similarities in text files, e.g., source code files, to assist in the determination of changes contained in such files. For example, Udi Manber, Finding Similar Files in A Large File System, In Proc. 1994 Winter Usenix Technical Conference, pp. 1-10, January 1994, which is hereby incorporated by reference for all purposes, describes the so-called "Siff" tool which analyzes large numbers of files to find similar files based on common "fingerprints" corresponding to sequences of characters. Still further, Brenda S. Baker, Parameterized Pattern Matching: Algorithms and Applications, J. Comput. Syst. Sci., 52(1):28-42, February 1996, which is hereby incorporated by reference for all purposes, describes the so-called "Dup" tool which searches sets of source files for so-called "almost-matching" sections of code for software systems having source code sizes as large as millions of lines. Dup's notion of almost-matching is a parameterized match ("p-match") such that two sections of codes are a p-match if they are the same textually except for possibly a systematic change of variable names. Dup is useful for a variety of applications involving the detection of similarities such as identifying undesirable duplication within a large software system, detecting plagiarism between computer systems, and analyzing the divergence of two systems having a common origin. In addition, the well-known UNIX utility "Diff" (see, e.g., Brian W. Kernighan et al. The UNIX Programming Environment, Prentice-Hall, Englewood Cliffs, N.J., 1984) uses dynamic programming to identify line-by-line changes from one file to another, and is useful for the detailed comparison of two files and for automated distribution of patches.
The above-described techniques, as well as others readily known in the art, detect similarities of programs using source code files embodying the program. Thus, for applications where the source code files are available such techniques prove quite useful. However, in applications where the source code files are not available, e.g., binary code, these techniques are not as useful in detecting similarities between such binary programs. For example, in certain applications a single primary change to a source file, e.g., the insertion of a line of code, may result in many secondary changes to the compiled binary executable file as compared to the compiled binary executable file for the original, i.e., unaltered, source file. The secondary changes could include, e.g., changes to tables that are frequently referenced, changes to pointers, or changes in the encoding of jump instructions that jump across the newly inserted code. Therefore, two programs that have almost identical source code and functionality may have vastly different forms in their binary executable files. As such, source-based techniques which look for common sequences of bytes will not be very efficient in detecting similarities in such disparate binary files.
In particular, the emerging use of so-called Java.TM. bytecodes, particularly in the form of applets, for executing programs via the World Wide Web is one area where source code similarity techniques-are not useful. As is well-known, Java is a popular programming language which enables users to create applications that can be used and executed across the Internet without concerns about platform compatibility or network security. That is, Java is a platform-neutral language which means that programs developed using Java can execute on any computer system without the need for any modifcations. Such platform independence stems from the use of a special format for compiled Java programs called "bytecodes" which are a set of instructions which look similar to conventional machine code, but are not specific to any one processor. Thus, Java bytecode can be read and executed by any computer system that has a Java interpreter.
This is in contrast to compilers for non-Java programming languages, e.g., the well-known C programming language, which translate source programs into machine code or processor instructions which are specific to the processor or computer system. In such non-Java systems, if one wants to use the same program on another computer system, the source program must found and provided as input to the compiler for the different system for recompilation. Thereafter, the recompiled program can be executed on the different computer system. In contrast, to execute a Java program Java bytecodes are generated by a Java compiler which are executed by a Java interpreter, i.e., a bytecode interpreter, which in turn executes the Java program. Thus, placing the Java program in bytecode form enables the execution of such programs across any platform operating system, or windowing system so long as the Java interpreter is available.
The capability of having a single binary file, i.e., Java bytecode file, executable across multiple platforms is a key attribute which is making Java bytecode, particularly in the form of applets, a common way of executing programs across the World Wide Web (which as is well-known is also platform-independent.) In the near term, it is projected that various types of hardware devices, e.g., stand-alone computers, network computers, information appliances, home appliances and the like, will be controlled using Java bytecode programs which will be transmitted to such hardware devices across the Internet and World Wide Web. As will be appreciated, the need to control such bytecode programs in terms of areas such as security, update management, portability, handling preferences, deletions, and so on, will prove critical as such programs are exchanged among users.
As discussed above, while known detection techniques are effective in comparing source code programs these prior art solutions necessarily rely on having access to the original source code in order to identify similarities between programs. Further, such prior art techniques cannot effectively identify similarities in large numbers of different binary programs where large numbers of changes have occurred. An existing tool, the .RTPatch.RTM. patch-build program available from Pocket Soft, Inc., Houston, Tex. 77282, is useful in comparing two particular binary files for creating individual program patches for repairing or updating programs. This tool allows for the efficient distribution of only the changes to the program to the eventual end-users but does not appear to be directed to finding similarities in and between large number of binary files.
Therefore, a need exists for a technique which detects similarities between a large number of binary programs, e.g., Java bytecodes, without access to the underlying source code.