1. Field of the Invention
The present invention relates to software tools for comparing program source code files to determine the amount of similarity, or “correlation,” between the files and to pinpoint specific sections that are similar. In particular, the present invention relates to improving the analysis and interpretation of the results of a source code comparison by filtering out elements that are irrelevant to the comparison.
2. Discussion of the Related Art
Programs and algorithms that determine software source code correlation have been around for a number of years but have gotten more attention recently due to two main reasons. One reason is that the Internet and search engines like Google have made source code very easy to obtain. Another reason is the growing open source movement that allows programmers all over the world to write, distribute, and share code. It follows that programs that determine software source code correlation have become more sophisticated in recent years. It also follows that the amount of code to be compared has grown larger, especially as software projects have grown larger.
Finding a correlation between different sets of source code does not necessarily imply that illicit behavior occurred. There can be correlation between programs for a number of reasons as enumerated below.
Third-Party Source Code. It is possible that widely available open source code is used in both programs. Also, libraries of source code can be purchased from third-party vendors.
Code Generation Tools. Automatic code generation tools generate software source code using similar or identical identifiers for variables, classes, methods, and properties. Also, the structure of the code generated by these tools tends to fit into a certain template with an identifiable pattern.
Commonly Used Identifier Names. Certain identifier names are commonly taught in schools or commonly used by programmers in certain industries. For example, the identifier “result” is often used to hold the result of an operation.
Common Algorithms. Certain algorithms are most easily implemented using a certain sequence of statements in a particular programming language. Commonly used algorithms, such as for elementary functions, will often be coded in very similar ways and may have a high degree of correlation even though there was no direct contact between the authors.
Common Author. It is possible that one programmer will create two programs that have correlations simply because the programmer tends to use certain identifiers and tends to write code in a certain way.
Plagiarism. Code was copied from one program to another.
When a correlation program is run on sets of source code, often the user is looking to find one specific kind of correlation. For example, if the user is looking to find correlation due to plagiarism, he wants to eliminate the other five sources of correlation. The specific reasons for correlation can often not be determined until after a correlation program has been run and the results analyzed. At that time, it would be useful to be able to filter out correlation results due to forms of correlation that are not relevant. The present invention is a tool for doing just that.