Tools and algorithms have been developed over the last several decades to assist researchers in detecting software plagiarism. Typically these tools and algorithms compare software source code to find signs of copying. A summary of available tools and algorithms is given by Paul Clough in his paper entitled Plagiarism in Natural and Programming Languages: An Overview of Current Tools and Technologies. Clough discusses tools and algorithms for finding plagiarism in generic text documents as well as in programming language source code files.
There are a number of source code copy detection programs currently available including the Plague program developed by Geoff Whale at the University of New South Wales, the YAP programs (YAP, YAP2, YAP3) developed by Michael Wise at the University of Sydney, Australia, the JPlag program, written by Lutz Prechelt and Guido Malpohl of the University Karlsruhe and Michael Philippsen of the University of Erlangen-Nuremberg, and the Measure of Software Similarity (MOSS) program developed at the University of California at Berkeley by Alex Aiken.
The most commercially successful program for source code copy detection is CodeMatch®, developed by Robert Zeidman, which is incorporated in the CodeSuite® program. The CodeSuite program further includes other tools for measuring and comparing software source code including BitMatch®, CodeCLOC™, CodeCross®, CodeDiff®, and SourceDetective®.
Markup languages are forms of data description languages that are used for “marking up” text documents by providing additional information about the text. The Hypertext Markup Language (“HTML”), for example, uses tags within a text document to describe the layout of the text when displayed as web pages. Unlike programming languages, markup languages consist of tags that contain embedded layout information and other information. For example, tags can contain information about graphics, links, forms, form objects, comments, and scripting language statements. Because the tags contain many different types of information, it would be beneficial to have a tool that extracts the information from the markup language tags into files that can then be compared to find copying. It would also be beneficial to put the HTML code into a format that is usable by standard software source code copy detection tools to detect copying.