The present application describes systems and techniques relating to token stream differencing, for example, comparison of text documents to identify document changes.
Various techniques exist for comparing token streams. Such comparison is commonly referred to as differencing or as a diff operation. Differencing two token streams typically involves comparing two versions of a token stream, commonly referred to as the original stream and the modified stream, and looking for differences between them. In the context of text comparison, many differencing processes use individual text characters or words as the tokens. Such diff processes are fairly good at detecting additions to and deletions from a document. A typical output of a diff operation is an editscript, which is a sequential list of modifications needed to convert an original stream into a modified stream.
Some existing differencing techniques, such as that employed by WinDiff provided by Microsoft Corporation of Redmond, Wash., also support moved-text detection. Typically, these techniques operate at levels of token granularity above the word level. Some use tokens that represent entire lines of text in a document. Such whole-line techniques work well with documents containing software source code because the lines of text in such documents are most likely unique, and because changes to such documents rarely result in text reflows (e.g., adding a word that shifts some words to the next line, which in turn shifts more words to the next line and so on in a domino effect).
Differencing techniques that operate at word-level token granularity, such as the technique employed by Acrobat® 5.0 provided by Adobe Systems Incorporated of San Jose, Calif., frequently misidentify moved blocks of text as additions and deletions. Moreover, when such techniques actually do identify moved blocks, the displayed results can be very confusing because small additions and/or deletions within a moved block of text can create a checker-boarding effect in the generated results, where moved and unmoved words interleave each other, thus cluttering the results report.