Plagiarism is, in general, the act of copying work authored by another, including writings or, particularly, code, and willfully failing to attribute or acknowledging the original author. Plagiarism is easier to carry out and easier to hide, than it has ever been before because of the increasing ubiquity of information and the diversity of information sources available through the internet. To that end, several tools have been developed to detect plagiarism in writings of software code.
Extant tools or techniques for the detection of plagiarism in software code generally operate by means of comparing or matching suspect source code file by file. In some instances, a source code file may be preprocessed or converted to some intermediate form and a matching algorithm that maps the source file to a target file may be applied thereafter. The output of such an operation may generally take the form of a number or a percentage that indicates a degree of plagiarism in the source file.
However, such an approach, absent more, may be unable to efficiently detect plagiarism that is intelligently distributed across multiple source files and obscured by exploiting the structure of the software code. For example, distributing plagiarized material across multiple files, classes or functions in the body of the source code may successfully serve to circumvent a plagiarism detection method using a percentage or threshold based output metric by limiting copied material in each of the compared source files to a level below that flagged by the tool. One such technique may involve the obscuring of plagiarized source code by adapting the plagiarized code into object oriented code through the adoption of one or more software design patterns into the code files.
A method for plagiarism detection that can address such a scenario is therefore needed.