1. Field of the Invention
The present invention relates to software tools for comparing text files to determine the amount of similarity between the files. In particular, the present invention relates to searching the Internet to determine the frequency of usage of terms that are common between two programs in order to determine whether the files that have been copied or derived, in full or in part, from each other or from a common third file.
2. Discussion of the Related Art
Software plagiarism detection programs and algorithms have been around for a number of years but have gotten more attention recently due to two main factors. One reason is that the Internet and search engines like Google have made source code very easy to obtain. Another reason is the growing open source movement that allows programmers all over the world to write, distribute, and share code. It follows that plagiarism detection programs have become more sophisticated in recent years. An excellent summary of available tools is given by Paul Clough in his paper, “Plagiarism in natural and programming languages: an overview of current tools and technologies.” Clough discusses tools and algorithms for finding plagiarism in generic text documents as well as in programming language source code files. Following are brief descriptions of prior art consisting of four of the most popular tools and their algorithms.
The prior art Plague program was developed by Geoff Whale at the University of New South Wales. Plague uses an algorithm that creates what is called a structure-metric, based on matching code structures rather than matching the code itself. The idea is that two pieces of source code that have the same structures are likely to have been copied. The Plague algorithm ignores comments, variable names, function names, and other elements that can easily be globally or locally modified in an attempt to fool a plagiarism detection tool.
Plague has three phases to its detection, as illustrated in FIG. 1:
In the first phase 101, a sequence of tokens and structure metrics are created to form a structure profile for each source code file. In other words, each program is boiled down to basic elements that represent control structures and data structures in the program.
In the second phase 102, the structure profiles are compared to find similar code structures. Pairs of files with similar code structures are moved into the next stage.
In the final stage 103, token sequences within matching source code structures are compared using a variant of the Longest Common Subsequence (LCS) algorithm to find similarity.
The prior art YAP programs (YAP, YAP2, and YAP3) were developed by Michael Wise at the University of Sydney, Australia. YAP stands for “Yet Another Plague” and is an extension of Plague. All three version of YAP use algorithms, illustrated in FIG. 2, that can generally be described in two phases as follows:
In the first phase 201, generate a list of tokens for each source code file.
In the second phase 202, compare pairs of token files.
The first phase of the algorithm is identical for all three programs. The steps of this phase, illustrated in FIG. 2, are:
In step 203 remove comments and string constants.
In step 204 translate upper-case letters to lower-case.
In step 205, map synonyms to a common form. In other words, substitute a basic set of programming language statements for common, nearly equivalent statements. As an example using the C language, the language keyword “strncmp” would be mapped to “strcmp”, and the language keyword “function” would be mapped to “procedure”.
In step 206, reorder the functions into their calling order. The first call to each function is expanded inline and tokens are substituted appropriately. Each subsequent call to the same function is simply replaced by the token FUN.
In step 207, remove all tokens that are not specifically programming language keywords.
The second phase 202 of the algorithm is identical for YAP and YAP2. YAP relied on the sdiff function in UNIX to compare lists of tokens for the longest common sequence of tokens. YAP2, implemented in Perl, improved performance in the second phase 202 by utilizing a more sophisticated algorithm known as Heckel's algorithm. One limitation of YAP and YAP2 that was recognized by Wise was difficulty dealing with transposed code. In other words, functions or individual statements could be rearranged to hide plagiarism. So for YAP3, the second phase uses the Running-Karp-Rabin Greedy-String-Tiling (RKR-GST) algorithm that is more immune to tokens being transposed.
The prior art JPlag is a program, written in Java by Lutz Prechelt and Guido Malpohl of the University Karlsruhe and Michael Philippsen of the University of Erlangen-Nuremberg, to detect plagiarism in Java, Scheme, C, or C++ source code. Like other plagiarism detection programs, JPlag works in phases as illustrated in FIG. 3:
There are two steps in the first phase 301. In the first step 303, whitespace, comments, and identifier names are removed. As with Plague and the YAP programs, in the second step 304, the remaining language statements are replaced by tokens.
As with YAP3, the method of Greedy String Tiling is used to compare tokens in different files in the second phase 302. A larger number of matching tokens corresponds to a higher degree of similarity and a greater chance of plagiarism.
The prior art Measure of Software Similarity (MOSS) program was developed at the University of California at Berkeley by Alex Aiken. MOSS uses a winnowing algorithm. The MOSS algorithm can be described by these steps, as illustrated in FIG. 4:
In the first step 401, remove all whitespace and punctuation from each source code file and convert all characters to lower case.
In the second step 402, divide the remaining non-whitespace characters of each file into k-grams, which are contiguous substrings of length k, by sliding a window of size k through the file. In this way the second character of the first k-gram is the first character of the second k-gram and so on.
In the third step 403, hash each k-gram and select a subset of all k-grams to be the fingerprints of the document. The fingerprint includes information about the position of each selected k-gram in the document.
In the fourth step 404, compare file fingerprints to find similar files.
An example of the algorithm for creating these fingerprints is shown in FIG. 5. Some text to be compared 501 is shown in FIG. 5A. The 5-grams 502 derived from the text 501 are shown in FIG. 5B. A possible sequence of hashes 503 is shown in FIG. 5C. A possible selection of hashes 504 chosen to be the fingerprint for the text 501 is shown in FIG. 5D. The concept is that the hash function is chosen so that the probability of collisions is very small so that whenever two documents share fingerprints, it is extremely likely that they share k-grams as well and thus contain plagiarized code.
The prior art CodeMatch® program (CodeSuite is a registered trademark of Software Analysis & Forensic Engineering Corporation) was developed by Robert Zeidman and is sold by Software Analysis & Forensic Engineering Corporation. CodeMatch corrects many, if not all, of the deficiencies noted in the previous program. Initially CodeMatch divides the source code files for two different programs into lists of basic elements consisting of statements, comments, strings, and identifiers as shown in FIG. 6. A snippet of source code 601 is shown in FIG. 6A. The statement list 602 derived from the source code 601 is shown in FIG. 6B. The comment/string list 603 derived from the source code 601 is shown in FIG. 6B. The identifier list 604 derived from the source code 601 is shown in FIG. 6C.
CodeMatch then uses the method illustrated in FIG. 7 to calculate a correlation between the two sets of files. In the first step 701, the statement, comment and string, and identifier lists for the two files to be compared are created. In the second step 702, the statement lists of the two files are compared using a statement matching algorithm. In the third step 703, the comment and string lists of the two files are compared using a comment and string matching algorithm. In the fourth step 704, the identifier lists of the two files are compared using an identifier matching algorithm. In the fifth step 705, the identifier lists of the two files are compared using a partial identifier matching algorithm. In the sixth step 706, the statement lists of the two files are compared using a statement sequence matching algorithm. Although all matching algorithms produce output for the user, in the seventh step 707, the results of all matching algorithms are combined into a single correlation score.
All of these prior art methods identify possibly plagiarized computer code, but rely on subjective determinations about whether or not plagiarism actually occurred. Finding a correlation between the source code files for two different programs does not necessarily mean that plagiarism occurred. It has been determined that there are exactly six reasons for correlation between the source code for two different programs. These reasons can be summarized as follows.
Third-Party Source Code. It is possible that widely available open source code is used in both programs. Also, libraries of source code can be purchased from third-party vendors. If two different programs use the same third-party code, the programs will be correlated.
Code Generation Tools. Automatic code generation tools, such as Microsoft Visual Basic or Adobe Dreamweaver, generate software source code that looks very similar with similar and often identical elements. The structure of the code generated by these tools tends to fit into specific templates with identifiable patterns. Two different programs that were developed using the same code generation tool will be correlated.
Commonly Used Identifier Names. Certain identifier names are commonly taught in schools or commonly used by programmers in certain industries. For example, the identifier result is often used to hold the result of an operation. These identifiers will be found in many unrelated programs and will result in these programs being correlated.
Common Algorithms. An algorithm is a procedure or a set of instructions for accomplishing some task. In one programming language there may be an easy or well-understood way of writing a particular algorithm that most programmers use. For example there might be a way to alphabetically sort a list of names. Perhaps this algorithm is taught in most programming classes at universities or is found in a popular programming textbook. These commonly used algorithms will show up in many different programs, resulting in a high degree of correlation between the programs even though there was no direct contact between the programmers.
Common Author. It is possible that one programmer, or “author,” will create two programs that have correlation simply because that programmer tends to write code in a certain way. This is the programmer's style of coding. Thus two programs written by the same programmer can be correlated due to the style being similar even though there was no copying and the functionality of each program is different than that of the other.
Copied Code (Authorized or Plagiarized). Code was copied from one program to another, causing the programs to be correlated. The copying may have taken place for only certain sections of the code and may include small or significant changes to the code. When each of the previous reasons for correlation has been eliminated, the reason that remains is copying. If the copying was not authorized by the original owner, then it comprises plagiarism.
A useful tool is one that can help determine whether correlation is due to any of these factors in order to determine whether plagiarism occurred.