1. Field
This disclosure is generally related to analysis of document similarities. More specifically, this disclosure is related to measuring document similarities by inferring the evolution of documents through reuse of passage sequences.
2. Related Art
Modern workers often deal with large numbers of documents; some are self-authored, some are received from colleagues via email, and some are downloaded from websites. Many documents are often related to one another as a user may modify an existing document to generate a new document. For example, a worker may generate an annual report by combining a number of previously generated monthly reports. In a further example, a presenter at a meeting may use similar slides modified from an earlier presentation at a different meeting.
Conventional methods for identifying similarities between documents include calculating Levenshtein distance (or editing distance) between strings within the documents, or using certain string alignment algorithms, such as the Smith-Waterman algorithm, to perform sequence alignment for strings within the documents. However, such approaches do not consider possible operations performed by a user when generating a new document from existing documents.