The present disclosure relates generally to automated document analysis and in particular to identification and clustering of near-duplicate documents.
With the proliferation of computing devices and communication networks such as the Internet, an ever increasing amount of information is stored in the form of electronic documents. Such documents might be generated using application software such as word processing programs, e-mail programs, web page development tools, etc. Electronic documents can also be generated by scanning paper documents and employing optical character recognition (“OCR”) or other techniques to create an electronic representation of the content.
It is often necessary to search through a large collection of electronic documents to find information relevant to a particular question. For example, a number of search services provide interfaces via which users can search electronic documents that are accessible via the World Wide Web. In another context, discovery in civil litigation usually involves the production of massive quantities of electronic documents that the producing and receiving parties must sift through.
Often, a large collection of documents will include multiple documents that are near-duplicates of each other. For example, in the context of electronic document discovery, a party may produce multiple drafts of a contract whose terms were being negotiated. The drafts will often be largely identical in content, but the wording in sections under discussion will vary from one draft to the next. As another example, multiple e-mail messages from the same discussion thread (including e.g., replies and/or forwarded e-mails) may be identical except for the addition of a few words and changes in the message headers from one message to the next. As another example, in the context of the World Wide Web, several pages on different sites may copy the same content from a single source (e.g., a public-domain source), and the pages may differ only in ancillary features such as layout, titles, lists of related links, etc.
Identifying near-duplicates of a document can be useful for a number of purposes. For example, in litigation, the electronic documents being produced often must be reviewed by human reviewers. Having the same reviewer handle a group of near-duplicate documents together improves the likelihood that the documents will be handled consistently. In addition, at times, reviewing each of the near-duplicates can yield interesting and potentially valuable information, such as the history of a contract negotiation. As another example, when a user is searching for a particular document, a single document from a group of near-duplicates can be used as representative of the group.