1. Field of the Invention
The present invention relates to software tools for comparing program source code files to determine the amount of similarity between the files and to pinpoint specific sections that are similar. In particular, the present invention relates to finding pairs of source code files that have been copied, in full or in part, from each other or from a common third file.
2. Discussion of the Related Art
Plagiarism detection programs and algorithms have been around for a number of years but have gotten more attention recently due to two main factors. One reason is that the Internet and search engines like Google have made source code very easy to obtain. Another reason is the growing open source movement that allows programmers all over the world to write, distribute, and share code. It follows that plagiarism detection programs have become more sophisticated in recent years. An excellent summary of available tools is given by Paul Clough in his paper, “Plagiarism in natural and programming languages: an overview of current tools and technologies.” Clough discusses tools and algorithms for finding plagiarism in generic text documents as well as in programming language source code files. The present invention only relates to tools and algorithms for finding plagiarism in programming language source code files and so the discussion will be confined to those types of tools. Following are brief descriptions of four of the most popular tools and their algorithms.
The Plague program was developed by Geoff Whale at the University of New South Wales. Plague uses an algorithm that creates what is called a structure-metric, based on matching code structures rather than matching the code itself. The idea is that two pieces of source code that have the same structures are likely to have been copied. The Plague algorithm ignores comments, variable names, function names, and other elements that can easily be globally or locally modified in an attempt to fool a plagiarism detection tool.
Plague has three phases to its detection, as illustrated in FIG. 1:                1. In the first phase 101, a sequence of tokens and structure metrics are created to form a structure profile for each source code file. In other words, each program is boiled down to basic elements that represent control structures and data structures in the program.        2. In the second phase 102, the structure profiles are compared to find similar code structures. Pairs of files with similar code structures are moved into the next stage.        3. In the final stage 103, token sequences within matching source code structures are compared using a variant of the Longest Common Subsequence (LCS) algorithm to find similarity.        
Clough points out three problems with Plague:                1. Plague is hard to adapt to new programming languages because it is so dependent on expert knowledge of the programming language of the source code it is examining. The tokens depend on specific language statements and the structure metrics depend specific programming language structures.        2. The output of Plague consists of two indices H an HT that needs interpretation. While the output of each plagiarism detection program presented here relies on expert interpretation, results from-Plague are particularly obscure.        3. Plague uses UNIX shell tools for processing, which makes it slow. This is not an innate problem with the algorithm, which can be ported to compiled code for faster processing.        
There are other problems with Plague:                1. Plague is vulnerable to changing the order of code lines in the source code.        2. Plague throws out useful information when it discards comments, variable names, function names, and other identifiers.        
The first point is a problem because code sections can be rearranged and individual lines can be reordered to fool Plague into giving lower scores or missing copied code altogether. This is one method that sophisticated plagiarists use to hide malicious code theft.
The second point is a problem because comments, variable names, function names, and other identifiers can be very useful in finding plagiarism. These identifiers can pinpoint copied code immediately. Even in many cases of intentional copying, comments are left in the copied code and can be used to find matches. Common misspellings or the use of particular words throughout the program in two sets of source code can help identify them as having the same author even if the code structures themselves do not match. As we will see, this is a common problem with these plagiarism tools.
The YAP programs (YAP, YAP2, YAP3) were developed by Michael Wise at the University of Sydney, Australia. YAP stands for “Yet Another Plague” and is an extension of Plague. All three version of YAP use algorithms, illustrated in FIG. 2, that can generally be described in two phases as follows:                1. In the first phase 201, generate a list of tokens for each source code file.        2. In the second phase 202, compare pairs of token files.        
The first phase of the algorithm is identical for all three programs. The steps of this phase, illustrated in FIG. 2, are:                1. In step 203 remove comments and string constants.        2. In step 204 translate upper-case letters to lower-case.        3. In step 205, map synonyms to a common form. In other words, substitute a basic set of programming language statements for common, nearly equivalent statements. As an example using the C language, the language keyword “strncmp” would be mapped to “strcmp”, and the language keyword “function” would be mapped to “procedure”.        4. In step 206, reorder the functions into their calling order. The first call to each function is expanded inline and tokens are substituted appropriately. Each subsequent call to the same function is simply replaced by the token FUN.        5. In step 207, remove all tokens that are not specifically programming language keywords.        
The second phase 202 of the algorithm is identical for YAP and YAP2. YAP relied on the sdiff function in UNIX to compare lists of tokens for the longest common sequence of tokens. YAP2, implemented in Perl, improved performance in the second phase 202 by utilizing a more sophisticated algorithm known as Heckel's algorithm. One limitation of YAP and YAP2 that was recognized by Wise was difficulty dealing with transposed code. In other words, functions or individual statements could be rearranged to hide plagiarism. So for YAP3, the second phase uses the Running-Karp-Rabin Greedy-String-Tiling (RKR-GST) algorithm that is more immune to tokens being transposed.
YAP3 is an improvement over Plague in that it does not attempt a full parse of the programming language as Plague does. This simplifies the task of modifying the tool to work with other programming languages. Also, the new algorithm is better able to find matches in transposed lines of code.
There are still problems with YAP3 that need to be noted:                1. In order to decrease the run time of the program the RKR-GST algorithm uses hashing and only considers matches of strings of a minimal length. This opens up the algorithm to missing some matches.        2. The tokens used by YAP3 are still dependent on knowledge of the particular programming language of the files being compared.        3. Although less so than Plague, YAP3 is still vulnerable to changing the order of code lines in the source code.        4. YAP3 throws out much useful information when it discards comments, variable names, function names, and other identifiers that can-and have been used to find source code with common origins.        
JPlag is a program, written in Java by Lutz Prechelt and Guido Malpohl of the University Karlsruhe and Michael Philippsen of the University of Erlangen-Nuremberg, to detect plagiarism in Java, Scheme, C, or C++ source code. Like other plagiarism detection programs, JPlag works in phases as illustrated in FIG. 3:                1. There are two steps in the first phase 301. In the first step 303, whitespace, comments, and identifier names are removed. As with Plague and the YAP programs, in the second step 304, the remaining language statements are replaced by tokens.        2. As with YAP3, the method of Greedy String Tiling is used to compare tokens in different files in the second phase 302. More matching tokens corresponds to a higher degree of similarity and a greater chance of plagiarism.        
As can be seen from the description, JPlag is nearly identical in its algorithm to YAP3 though it uses different optimization procedures for reducing runtime. One difference is that JPlag produces a very nice HTML output with detailed plots comparing file similarities. It also allows the user to click on a file combination to bring up windows showing both files with areas of similarity highlighted. The limitations of JPlag are the same limitations that apply to YAP3 that have been listed previously.
The Measure of Software Similarity (MOSS) program was developed at the University of California at Berkeley by Alex Aiken. MOSS uses a winnowing algorithm. The MOSS algorithm can be described by these steps, as illustrated in FIG. 4:                1. In the first step 401, remove all whitespace and punctuation from each source code file and convert all characters to lower case.        2. In the second step 402, divide the remaining non-whitespace characters of each file into k-grams, which are contiguous substrings of length k, by sliding a window of size k through the file. In this way the second character of the first k-gram is the first character of the second k-gram and so on.        3. In the third step 403, hash each k-gram and select a subset of all k-grams to be the fingerprints of the document. The fingerprint includes information about the position of each selected k-gram in the document.        4. In the fourth step 404, compare file fingerprints to find similar files.        
An example of the algorithm for creating these fingerprints is shown in FIG. 5. Some text to be compared is shown in part (a) 501. The 5-grams derived from the text is shown in part (b) 502. A possible sequence of hashes is shown in part (c) 503. A possible selection of hashes chosen to be the fingerprint for the text is shown in part (d) 504. The concept is that the hash function is chosen so that the probability of collisions is very small so that whenever two documents share fingerprints, it is extremely likely that they share k-grams as well and thus contain plagiarized code.
Of all the programs discussed here, MOSS throws out the most information. The algorithm attempts to keep enough critical information to flag similarities. The algorithm is also noted to have a very low occurrence of false positives. The problem using this algorithm for detecting source code plagiarism is that it produces a high occurrence of false negatives. In other words, matches can be missed. The reason for this is as follows:                1. By treating source code files like generic text files, much structural information is lost that can be used to find matches. For example, whitespace, punctuation, and uppercase characters have significant meaning in programming languages but are thrown out by MOSS.        2. Smaller k-grams increase the execution time of the program, but increase the sensitivity. MOSS makes the tradeoff of time for efficiency and typically uses a 5-gram. However, many programming language statements are less than 5 characters and can be missed.        3. Most of the k-grams are also thrown out, reducing the accuracy even further.        