The present disclosure relates generally to automated document analysis and in particular to composite locality sensitive hash based processing of documents.
With the proliferation of computing devices and communication networks such as the Internet, an ever increasing amount of information is stored in the form of electronic documents. Such documents might be generated using application software such as word processing programs, e-mail programs, web page development tools, etc. Electronic documents can also be generated by scanning paper documents and employing optical character recognition (“OCR”) or other techniques to create an electronic representation of the content.
It is often necessary to search through a large collection of electronic documents to find information relevant to a particular question. For example, a number of search services provide interfaces via which users can search electronic documents that are accessible via the World Wide Web. In another context, discovery in civil litigation usually involves the production of massive quantities of electronic documents that the producing and receiving parties must sift through.
To facilitate review of a large corpus of documents, a number of analysis techniques have been developed that automatically determine properties of the document, e.g., by analyzing the patterns of occurrence of words. For example, semantic clustering attempts to group documents pertaining to the same topic, generally based on identifying words or combinations of words that tend to occur in documents within the cluster but not in documents outside the cluster. Automated language identification attempts to determine, e.g., from character sets or character sequences, what language a particular document is in.
Often, a large collection of documents will include multiple documents that are very similar or even identical to each other. For example, in the context of electronic document discovery, a party may produce multiple drafts of a contract whose terms were being negotiated. The drafts will often be largely identical in content, but the wording in sections under discussion will vary from one draft to the next. As another example, multiple e-mail messages from the same discussion thread (including e.g., replies and/or forwarded e-mails) may be identical except for the addition of a few words and changes in the message headers from one message to the next. As another example, in the context of the World Wide Web, several pages on different sites may copy the same content from a single source (e.g., a public-domain source), and the pages may differ only in ancillary features such as layout, titles, lists of related links, etc. A considerable amount of time can be spent analyzing such documents.