Plagiarism detection programs and algorithms have been around for a number of years but have gotten more attention recently due to two main factors. Firstly, the Internet and search engines like Google have made source code very easy to obtain. Secondly, open source movement has grown tremendously over the past several years, allowing programmers all over the world to write, distribute, and share code.
In recent years, plagiarism detection techniques have become more sophisticated. A summary of available tools is given by Paul Clough in his paper entitled “Plagiarism in natural and programming languages: an overview of current tools and technologies.” Clough discusses tools and algorithms for finding plagiarism in generic text documents as well as in programming language source code files.
There are a number of plagiarism detection programs currently available including the Plague program developed by Geoff Whale at the University of New South Wales, the YAP programs (YAP, YAP2, YAP3) developed by Michael Wise at the University of Sydney, Australia, the JPlag program, written by Lutz Prechelt and Guido Malpohl of the University Karlsruhe and Michael Philippsen of the University of Erlangen-Nuremberg, the Measure of Software Similarity (MOSS) program developed at the University of California at Berkeley by Alex Aiken, and the CodeMatch® program developed by the inventor of the present invention, Robert Zeidman.
A deficiency of the aforementioned programs is that they require source code for both programs to be compared. For most commercial software, source code is proprietary and highly guarded intellectual property of a company and is not available. Nor is the source code readily turned over to another party for comparison without a court order. Unavailability of a competitor's source code is a problem for companies that wish to determine whether their source code has been copied by a competitor.
Source code is typically compiled into electronic machine-readable ones and zeros (“bits” or “binary”) that comprise object code, which is given to users to run on their computers or, in the case of software libraries, linked into other programs. Some conventional software analysis tools can assist in detecting copied code by comparing object code with source code using one of several methods. For example, one method involves “decompiling” object code into the higher level language in which the source code was originally written and then comparing two sets of source code. This method has a number of drawbacks. First, it is usually not known which programming language was originally used to develop the program and so decompiling it into a different programming language source code would yield poor results or in some cases unusable results if it can be done at all. Another drawback of this method is that even when the original programming language is known, in order to recreate something that looked like the original code, it would be necessary to have much more information such as the particular compiler program, the compiler version, the compiler settings, and the libraries that were used to compile the original source code into object code. The decompiling method rarely produces any usable results.
Another conventional method creates a detailed block diagram of the control flow or data flow of the executable program and compares this block diagram to the same kind of detailed block diagram of the executable program produced by compiling the source code to which it is to be compared. A drawback of this method is that the diagrams can be extremely complex. As a result, their comparison is typically very time consuming and is hard to automate.
Accordingly, it would be beneficial to have a plagiarism detection tool that can overcome the above limitations of the conventional techniques.