In the writing and editing process of a text by multiple authors, it is of vital interest to know the authorship of specific text passages in the final version of the text. On the one hand, authors are interested that their specific contributions are knowable, independently from the contributions or modifications by the other authors. By tracking this information it is possible to estimate the value and extent of contributions by different authors and to credit the authors correspondingly. In addition, knowing the exact provenance and authorship of specific text passages or even specific words is important information, supporting decisions and influencing a reader's evaluation of a particular text.
An example, which illustrates these aspects, is the well known Wikipedia online encyclopedia, which enables literally everyone to contribute to a text. The success of this multi-author model has been tremendous. However, for the reader of a text in Wikipedia it is difficult and often practically impossible to reconstruct which author has contributed which specific text passages or words. In particular, there are no conventional mechanisms available to directly track and identify the authorship of a specific text fragment in a text from Wikipedia, or other collaborative text editing systems. Conventional text comparison mechanisms are incapable of integrating hundreds or even thousands of text versions in a consistent way. Potential authors may thus be less motivated to contribute to a collaborative text, since it is not possible to credit the authors according to their contributions. Furthermore, readers might refrain from using collaborative text resources, because of the uncertainty about the origin of specific content.
Another example is the writing or editing of a text using a word processing application. Current state-of-the-art applications, like the well known Microsoft Word™ for example, provide so-called track changes mechanisms, which save every atomic editing step by the editing author. Although track changes mechanisms are popular and widely used, they are technically insufficient to directly identify, which author contributed first with a specific text passage to a text. Track changes mechanisms are insufficient for example, if an author creates a text passage in another application than the word processing application, and then simply copies the text passage into the text of the word processing application. Using conventional track changes mechanisms, it is impossible to identify if the inserted text passage had not been present in this or a modified form in an earlier version of the text. In addition to this fundamental technical insufficiency, which will be set forth in more detail below with reference to FIG. 3, track changes mechanisms deteriorate the readability of texts. For this reason, track changes are typically suppressed in the final version of a text.
Additionally, other methodologies exist for analyzing text to detect changes. In one example, analyzing word counts before and after an editing process is employed. Although such term frequencies can be used to identify new terms and deleted terms, they are insufficient to identify the position of inserts or deletions. Sophisticated edit distance measures (e.g. Levenshtein distance, Hamming distance, etc) exhibit similar limitations, because they neglect the sequential flow and organization of natural language.
In the field of machine based translation, the concept of alignment has been employed to identify equivalent lines of text from different languages, however these methodologies are employed to assist in translation and provide no insight in how to employ alignment to track authorship of content in data.
Accordingly, various embodiments of the invention address one or more of the above identified problems of the related art.