The problem addressed by this application is the identification in a document of user significant phrases, as indicated by repetition of particular sequences of words. Typically, in documents created by precision-oriented users of various sorts, certain groups of words are used to convey a particular idea. Each time the user desires to express that idea, the user prefers to utilize the same phrasing over others in order to avoid confusion of meaning. In order to determine whether a significant phrase has already been used in the documents, the user must review the document to extract the relevant sequence of words. Once a phrase has been extracted, the user may refer to it on a continual basis to ensure that, whenever that user desires to express a similar idea, a form of the extracted sequence of words is used.
Problems related to, and auxiliary to, the problem of extracting user-created standard phrases include: the identification of significant sequences of words nested within otherwise significant sequences of words; and establishing equivalence of nearly identical sequences of words where the only difference among the sequences relates to certain known, structural elements. Problems related to, and auxiliary to, the problem of extracting user-created phrases that are substantially similar to user-created standard phrases include: the computation of the phrase(s) that transform the substantially similar phrase into the user-created standard phrase or that transform the user-created standard phrase into the substantially similar phrase, standardizing the discrepancies of the two phrases while retaining the remainder of the attributes and content of the conformed phrase.
The problem of standardizing phrasing, as described above, is one currently performed only manually. The human user conducts a time-consuming review of a document for significant phrases. This review is made in an attempt to detect the standard way of phrasing an idea in order to ensure continued phrasing of that idea in a manner that conforms to earlier phrasing.
Further, the human reviewer seeks to identify similar yet non-identical phrases in order to conform them. There is generally no explicit extraction and designation of standard phrases; these phrases are left within their contexts and simply used as the standards to which similar expressions must conform. Similarly, there is no explicit extraction and designation of phrases substantially similar to standard phrases. These phrases are also left within their contexts and are either conformed to the significant phrases to which they are substantially similar or are used as the master phrasing to which other similar phrases are standardized, including even the phrasing that constitutes the user standard phrasing.
The construction of a suffix tree for a given element is a method of representing every distinct sub-sequence of items constituting a suffix of that element. This representation is heavily utilized in dictionary storage and compression, used in, among other things, spelling checkers. This representation enables compressed storage of the element represented on the tree and is typically used on the level of character strings, not words. The subject invention uses inter alia aspects of a modified suffix tree representation. However, the suffix tree constructed for this application is based on stemmed words and abstracted known elements, not character strings. Word-level representation is significant for two reasons: First, words, and not individual characters, are the natural atomic units of phrases. Second, higher level word-based analysis is more efficient than lower level character-based analysis.
In addition, the suffix tree is usually used for the applications of storage, compression, and searching. In the subject application, the tree is used not for document or phrase storage, but rather for phrase identification by establishing word sequences that satisfy the criteria for length and recurrence in the document. In more detail, each node of the tree is associated with a record of the number of occurrences of the word sequence at that node; any such word sequence of sufficient length, where the number of occurrences exceeds the required threshold, is preliminarily designated a phrase. Inclusion on the final phrase list follows the post-processing steps outlined below.
The tree also serves to signal the occurrence of nested phrases wholly within and at the beginning of a nesting phrase. These phrases may be located on the suffix tree at no extra cost to efficiency or complexity. Such prefix phrases may be standard phrases in their own right, but in order to be designated as such, they must be of sufficient length and must occur independently of the nesting phrase a certain number of times.
An algorithm for the construction of a word-based suffix tree has been published by Andersson, et al. (Andersson, A. Larsson Jesper N. Swanson, K. "Suffix Tree on Words," Department of Computer Science, Lund University, Jul. 12, 1995.) Andersson, et al. is neither related to nor contains aspects related to the subject invention because Andersson does not relate at all to the overall process that is the subject of this application, standardizing document phrasing. Further, Andersson deals only with the construction of a word-level suffix tree; it does not relate at all to the process of standard phrase extraction. Further, Andersson constructs its word-based suffix tree on the level of the entire document and does not innovate the sentence suffix tree structure that enables the subject method its unique combination of efficiency and non-complexity. Further, Andersson does not attempt to pre-process the text at all through stemming and abstraction of known characters. Lastly, Andersson, does not address the related problem of nested phrases or any resolution thereof.