1. Field of the Invention
Embodiments of the invention relate to managing files. More specifically, the field of the invention relates to detecting duplicate documents using classification.
2. Description of the Related Art
Many applications today manage files. For example, file systems, web sites, and content repositories are often used manage files. The files may include documents that are exact duplicates of one another. The files may also include documents that, while not being exact copies, are near duplicates of one another. When searching or managing files, it may be useful to identify duplicates and near-duplicates. When searching, it may be desirable to collapse a set of duplicates into a single result in a search results display. When managing content, it may be desirable to identify and eliminate duplicates from search results or from storage systems.
Some systems identify duplicates using metadata. For example, some systems may use metadata such as document title, document size, and document creation date, etc. (or some combination thereof) to identify duplicates. Other systems identify duplicates using hash algorithms. For example, some systems may use hash algorithms (e.g., Message-Digest algorithm 5 (MD5) or Secure Hash Algorithm (SHA)) to generate signatures of documents. The generated signatures may be then used to identify duplicates. Of course, when using a hash algorithm such as MD5 or SHA-1 even a single-bit difference in the binary representation of a document will result in non-identical hash values for that document. Thus, hash algorithms are ineffective for identifying whether two documents are near duplicates of one another.