Almost all organizations now store a substantial amount of their information, including sensitive information that may contain intellectual property, as electronic files in a variety of formats. There are many reasons for this trend, including the low cost and widespread availability of computers, the ever decreasing cost for electronic and magnetic storage media itself, access control, and the relative ease with which archival backups of information may be maintained.
One strong motivation for electronically storing data is the ease with which one can then efficiently query large quantities of files for specific information. Several algorithmic approaches have been proposed to address this problem. One widely known technique is limited to textual content and is most commonly used in Web-based search engines. In this approach, a user types a word or a set of words into a search engine and the search engine then processes a pre-indexed image of a huge data collection to fetch documents that contain the word and/or words specified in the search criteria.
A refinement of this approach enables the user to input the information in a more user-friendly, human language form (as opposed to a set of words or word combinations linked with Boolean-logic like operators, e.g. “dealer AND truck AND Boston AND sale”). These so-called “natural language” interfaces permit a user to input a query such as “Which truck dealer in Boston area is currently advertising a sale?”. Other techniques such as image pattern recognition and mathematical correlation can be used for finding information in non-textual data collections, such as in pictures (e.g. to find if a person whose face is captured by a security camera is located in a database of known criminals).
As technology has evolved, and as hardware has become more available and affordable, computer users gained the ability (and actually prefer) to keep multiple copies of the same document. Such copies often differ only by a small amount of edits: text appended, removed or rearranged; images cropped; one document split into two, or a few documents merged. A document might be also converted to a different format, e.g. a text file with typesetting instructions can be converted into a print-ready form. These multiple copies of the same or a very similar document might be kept on the same computer. However, they may also be distributed among many computers connected to a local area network or wide area network, thus residing in different departments, or may even be in multiple locations that are physically many thousands of miles apart.
The ease with which multiple copies of the same document may be created, however, causes certain problems. Among these concerns are    data security—the more copies of a document there are, the harder it is to control access to its content.    document classification—copies of similar documents may need to be processed in the same way, without user intervention, and it is desirable to be able to do this automatically.    genealogy—identifying the history of how a particular document evolved    forensics—identifying who may have tampered with a document    regulatory compliance—certain laws and rules in the health and financial industry now require that documents be access controlled and/or automatically destroyed after certain time periods.
Existing data mining algorithms are not efficient, accurate or scalable enough to calculate similarity between documents and reconstruct document distribution paths.