In many business situations it is common for multiple versions of one or more documents to be created. Some businesses use tools such as Document Management Systems (DMS) or other content repositories to try to track and store each version of the document that is created. Even when such systems are in use, versions tend to be created and/or stored in locations outside the DMS when copies of the document are sent by email, received from 3rd party contributors, copied for offline editing, etc. This problem is most acute for document formats that encourage editing (such as Microsoft™ Office™ format documents) as opposed to document formats which are largely used for presentation of a final copy (such as Adobe™ PDF documents).
The problem facing a document author or collaborator is often this: having received or found a new version of a document, how do they decide what to do with it? Was the version of a document that has arrived in an email message created by editing the most recent version stored in the DMS? Was it created by editing an older version of the document? Is it just a duplicate of some other version of the document? Depending on the answers to these questions, different actions are required—for instance in the first case of the document being created by editing the latest DMS version it is likely enough just to save the received version as a new version into the DMS. In the second case it is likely that the changes made to the received version need to be merged into the latest DMS version, while in the last case no action at all may be required.
In these circumstances, a software tool capable of determining the genealogical relationships between document versions automatically would provide great value as it would provide the document author/collaborator with relevant information allowing them to make a proper decision on the action needed when new versions of a document are located or received. In order to be useful in the situations described above, the tool must be capable of determining genealogical relationships based on the content of the documents only, as other meta-information such as DMS version information, file names, file timestamps, etc., may not be present or may be modified in some or all versions located outside the DMS—for instance copied files may have altered names or timestamps and files sent via email may have lost their original timestamp.
A tool capable of determining document genealogy from content only would also be useful in the context of document forensics—in cases where large collections of documents and versions of documents have been collected and investigators wish to piece together the history of the document or documents involved.