A. Field of the Invention
Implementations consistent with the principles of the invention relate generally to similarity detection and, more particularly, to comparing sets of documents to find similarities or differences.
B. Description of Related Art
There are a number of situations in which it may be desirable to be able to determine whether documents, or sets of documents, are similar to one another. One particular instance of this situation occurs in software engineering. Consider the situation in which there are two bases of source program code (“codebases”) that each define different versions of the same program. Each codebase may include, for example, a number of source code files. In some situations, the number of files may number into the hundreds or thousands. It may be desirable to determine what the differences are between the two codebases.
One existing technique for monitoring differences between two sets of documents, such as two sets of files, is to keep an explicit history of changes in the files. That is, each time a file is modified, the changes are logged and stored. Keeping an explicit history of file differences, however, is not always possible or may simply have not been done when the files were being modified.
Another existing technique for monitoring differences between two sets of files is to compare the files to obtain a list of differences between the files. Software for comparing files is well known. Comparing two files to obtain a list of their differences, however, generally only works well if the filename and directory structure of the file sets are similar or if the number of files is small enough to manually examine.
When relatively large codebases are being examined without an explicit history of changes, however, existing techniques can be lacking in the ability to effectively determine the differences in the codebases.