1. Technical Field
This invention relates generally to the field of language processing systems and, more particularly, to a method and apparatus for reconstructing a token from a token fragment, a token being a string of characters or symbols having meaning in a language of a language processing system.
2. Description of the Relevant Art
The problem solved by the principles underlying the present invention has its roots in information theory. A message having information content may be corrupted by noise or other adverse influences on the tokens from which the message may be constructed. A particular token may be practically unrecognizable once subjected to character or symbol deletion, addition, substitution, or transposition. The token may be even intentionally abbreviated or truncated to a fragment which is no longer recognizable.
Two characteristics of the problem are that:
(1) between the point at which a message originates and the point at which the message is received and analyzed, tokens may be corrupted into fragments that are typically shorter than the original tokens by a variety of causes, some of which are: electromechanical corruption caused by transmission system noise, truncation, or dropped bits; human corruption caused by mistyping, misspelling, or the use of arbitrary abbreviations; and
(2) messages of a given language follow syntactic conventions and use constrained vocabularies, i.e., the number of tokens that may be used at any given point within a message is limited. Command languages for computer systems are one example of language systems that possess these characteristics. Formatted military messages are another example.
The analysis at a receiver must remove corruption by reconstructing any corrupted tokens and by substituting the reconstructed tokens for the corrupted tokens in the message. This differs from the problem addressed by spelling checkers in that a single reconstructed token must be output, not a number of choices. In other words, a spelling checker need only test for membership of a string within a set of correctly-spelled strings. The present problem also differs from the problem addressed by spelling correction mechanisms in its assumptions: a spelling correction mechanism assumes the input string is complete, with or without spelling or typographic errors, while this invention makes no assumptions concerning the correctness or completeness of the input string. (For example, because such mechanisms assume completeness, they can not reliably reconstruct from arbitrary abbreviations formed by random character deletions.)
The present invention also addresses certain limitations of the current art of command language processing for real-time control systems. To ensure recognition of input tokens that are true elements of a command statement, current command language processors must do at least one of the following:
(1) force entry of complete, totally correct, and unambiguous tokens;
(2) limit commands to one or two keystrokes that may be arbitrarily mnemonic at best to avoid ambiguity, e.g., `E` is for Edit while `X` is for execute; or
(3) force the application of a well-defined, non-arbitrary abbreviation scheme, for example, least-unambiguous truncation, which, like complete tokens, requires totally correct entry.
Other expressions of the present problem are data compression and expansion or data encryption and decryption. Data may be intentionally compressed for transmission, for example, for the purposes of accomplishing transmission within limited bandwidth. Also, data may be intentionally encrypted for transmission to avoid understanding except by those for whom the data are intended.
In addition, algorithms are known which are capable of searching an extensive set of records for a match with a string of characters or symbols in combination or within a predetermined proximity of other input strings. Such algorithms are applied in the art of information retrieval to identify, for example, citations to publications which may be of interest, from searching abstracts or full texts of the publications. They are related to a different but related problem than the problem of token reconstruction although token reconstruction may be applied to advantage in editing an information retrieval query.
To reconstruct and substitute tokens for token fragments according to the present invention, it is not only necessary to identify possible reconstructions from a given set of tokens, called the "vocabulary", for the unreconstructed token, but to identify which of the vocabulary tokens is the most likely reconstruction by computing a "reconstruction index" to measure the relative likelihood that a given vocabulary token is a correct reconstruction of the original input unreconstructed token.
In an article entitled "An Inductive Approach to Language Translation," published in the Communications of the ACM, November 1964, R. D. Faulk suggested three different measures of similarity: material, ordinal, and positional similarity. Material similarity relates to the extent of character matches between two strings. Ordinal similarity relates to the extent to which characters in two strings appear in the same order. Positional similarity relates to the extent to which characters in two strings are located at the same position. Faulk suggests that a total similarity function or scoring may result from a normalized combination of these three functions, by weighting each of these functions at a value of one third.
Early methodology for matching fragments to candidates is exemplified by the publications of A. J. Szanser on pattern recognition, error correction, and elastic matching techniques. Generally, Szanser suggests the steps of quickly extracting all non-identical versions of a fragment string from a candidate list and then using bracketing techniques to reduce operational time. The developmental efforts of Szanser and others through 1980 are discussed in some detail in Hall and Dowling's article "Approximate String Matching," Computing Surveys, Vol. 12, No. 4, December 1980, at pages 381-402.
A commercially available software package from Proximity Technology Inc. is described in part by U.S. Pat. No. 4,490,811 which was issued to Yianilos et al. Dec. 25, 1984. Yianilos el al. discloses a string comparator device system circuit and method involving, as do most algorithms in the art, a forward scan and reverse scan of an input string. The Yianilos method computes a measure of symmetric similarity between two strings. Consequently, the method may result in reconstruction by substitution of a symmetrically most similar substitute.
The Yianilos method is particularly well suited for spelling correction for large, unstructured vocabularies. However, the method of string comparison disclosed by Yianilos is computationally complex and hence very slow. Further, because the disclosed method computes a symmetric similarity measure rather than an asymmetric reconstruction index, it does not adequately or efficiently handle the problem of arbitrarily dropped characters and hence provides no general solution for the problem of abbreviations or truncations. Further, the method provides no means to interject or to vary fuzziness criteria, for example, as defined by R. E. Kimbrell.
As used in the art, the term "fuzziness" is taken from the mathematical study of fuzzy sets, as explained by R. E. Kimbrell in his article "Fuzzy Data Retrieval", AI Expert, July, 1988, at pages 56-63. In classical set theory, an item either is or is not a member of a set; a test for membership in a set evaluates to either true or false, a binary or boolean result. Fuzzy set theory admits uncertainty about membership in a set. For example, if members of set A are blue, round, and large while members of set B are red, square, and small, into which set does an item Q characterized as blue, round, and small go? Fuzzy set theory would allow analysis to be not binary, suggesting that there is a 2/3 chance that Q is a member of set A and a 1/3 chance that Q is a member of set B. In other words, Q is similar to both A and B in some respects, but overall Q is more similar to A, but the exact assignment of Q to set A or B is fuzzy or not clear.
Similarly, in ordinary string comparison approaches, a string (or substring) either is or is not exactly the same as another string (or substring). In contrast, a fuzzy string comparison algorithm evaluates how much one string is like another when the strings are not exactly alike; this is called a fuzzy comparison. A fuzzy comparison gives an ambiguous truth value, somewhere between true and false, between 0.00 and 1.00.
In the prior art, there is no fuzziness criteria or control over how much one string must resemble another string; fuzziness evaluation is a given, fixed attribute of the algorithm used to compare two strings.
Consequently, there remains a requirement in the art for a flexible, fast, and efficient algorithm and apparatus for token reconstruction which permits the user or system designer to adapt its application to specific vocabulary problems by changing input and/or output criteria, in particular by changing fuzziness criteria.