Software theft has been, and continues to be, pervasive. Individuals and companies typically try various techniques to combat software theft, including requiring a unique software key to install software, requiring online activation of software, requiring an active online connection to use software, encryption of software, and the like. Although these techniques typically prevent casual users from installing unauthorized copies, the techniques can typically be overcome by sophisticated users.
Another way to combat software theft is to try to identify the source of the stolen software using watermarks. This involves applying unique watermarks to each copy of the software so that when a stolen piece of software is found, the watermark in the stolen software will corresponding to one of the unique watermarks in the authorized software. This requires modification of the computer code, which is undesirable. Further, this technique can be overcome by removing the watermark from the stolen software or removing the watermark from the authorized software so that all further copies do not contain the unique watermark.
Software is typically written in a particular source code language and then converted (i.e., compiled) into compiled code prior to distribution. The conversion into compiled code is typically hardware and/or software specific. For example, a set of source code can be converted into one set of compiled code for computers running Microsoft Windows and into another set of compiled code for computers running a LINUX-based operating system. In addition to allowing the execution of the code on particular hardware/software configurations, compiled code protects the source code from being available to the end users because the compiled code cannot easily be converted back to the original source code.
The conversion from source code into compiled code for a particular hardware/software configuration is performed using a compiler. A compiler can convert a set of source code into compiled code for different hardware/software configurations or different compilers can be used to convert a set of source code into compiled code for different hardware/software configurations. Regardless, two sets of compiled code based on the same source code will have the same general functionality. However, the actual instructions for achieving this functionality will be different for the two sets of compiled code. Accordingly, it is not possible to detect copied source code by comparing the source code to the compiled code. Similarly, detecting copied source code that has been compiled for different hardware/software configurations typically requires the source code to be compiled for each different hardware/software configuration and then compared.
Typical solutions for detecting copied source code are resource intensive (i.e., requiring a lot of processing and memory resources), and thus these solutions are typically implemented for pairwise comparisons (i.e., one set of compiled code with another set of compiled code). The pairwise comparisons typically involve structural or syntactical representations of compiled code, which can fail to detect embedded code that has been copied and pasted because these changes to the source code result in larger structural changes to the final compiled code.
Typical solutions for comparing a set of compiled code against a number of different sets of compiled code rely on heuristics to reduce the number of candidate sets for consideration and then rely upon a pairwise comparison across the candidate sets. If the heuristics are not properly designed, the use of heuristics to reduce the number of sets of compiled code for comparison can result in omission of sets of compiled code that actually contain copied code. Further, this approach has failed to scale as the number of different sets of compiled code for comparison increases.