Approximate matching for a string of characters has applications in many areas. One area is spell-checking applications which locate words which are closely similar to a series of letters found in a text. Suggestions of correct spelling in a chosen language are found and displayed for a user to choose from. Another application of approximate searching is searching for a query correction in a search engine. Also, approximate matching for a string of characters may be used in a non-language based application, for example, close matches in a database may be required. Some of the known methods of approximate string matching are described below.
Linear searches can be carried out in which the target string is compared to each dictionary entry. Edit distance, n-grams, or other criteria can be used to reject candidates. This has the disadvantage that it is a very slow method of searching.
Partial enumeration using a hash function is another method. A special hash function is used which is invariant for certain types of mistakes. For example, the “soundex” function used by Oracle Corporation returns a phonetic representation of a string. This method inherits common drawbacks of hashing in which the quality depends on the function used, its performance may deteriorate to linear in the case of a high level of collisions, and there is additional space required to build the hash table.
The segmentation approach (n-gram method) is based on the assumption that the target string and candidates should have common substrings. The index is built from substrings of certain length (n-grams) and therefore it is possible to avoid a linear search in most cases and therefore this is faster than a linear search.
The classic spell-checker method is based on the modification of a target word according to known correction rules and performing simple look up in the dictionary. This method is suitable only for closed classes of applications. However, the method does have the advantage of bringing context to the process.
Trie data structures are used to carry out string searches particularly through large texts. The term “trie” stems from the word “retrieval”. Trie structures are multi-way tree structures which are useful for storing strings over an alphabet. Trie structures are used to store large dictionaries of words. The alphabet used in a trie structure can be defined for the given application, for example, {0,1} for binary files, {the 256 ASCII characters}, {a, b, c . . . x, y, z}, or another form of alphabet such as Unicode, which represents symbols of most word languages.
The concept of a trie data structure is that all strings with a common prefix propagate from a common node. A node has a number of child nodes of, at most, the number of characters in the alphabet and a terminator. The string can be followed from the root to the leaf at which there is a terminator that ends a string. In this way a trie-based dictionary can be built for a lexicon. For example, an English-language dictionary can be stored in this way. A trie-based dictionary has the advantage that the data is compressed due to the common entries for prefixes and possibly postfixes. A method of scanning a trie-based dictionary in order to recover approximate matches is called a trie walker.
IEEE Transactions on Knowledge and Data Engineering August 1996 (Vol. 8, No. 4) pp. 540-547, H. Shang, T. H. Merrett, “Tries for Approximate String Matching” describes a method of approximate string matching based on the usage of a trie-based dictionary. The lexicon of the trie-based dictionary is stored as a finite state machine i.e. along a path in the digital tree or trie as described in Proceedings of the International Symposium on Cooperative Database Systems for Advanced Applications, Kyoto, 1996. World Scientific, 1996. T. H. Merrett, H. Shang, Xiaoyan Zhao: “Database Structures, Based on Tries, for Text, Spatial, and General Data”, pp. 507-515.
The computational complexity of approximate string matching is a recurrent problem. The approximate matching procedure is a non-deterministic trie walker with rejects. Its computational complexity for a certain target word depends only on the target word length and the average distribution of the degree of a dictionary graph group, which is not much correlated to the dictionary size for natural languages. Thus, the complexity remains sub-linear to the size of the dictionary in practice for natural languages, i.e. with the growth of dictionary size the number of operations tends to be proportional to the length of the target word rather than the number of dictionary entries.
Approximate string matching in a trie-based dictionary also allows the combination of exact and approximate word matching. These advantages plus the fact that trie indexes contract prefixes and possibly postfixes and thus are compact in storage are the reasons of prevalence for this method of approximate matching.
Along with the advantages stated above, the trie-based method has drawbacks. One of the drawbacks is that it is not possible to use the method in its pure form for certain applications as natively it operates out of context. The only context available is the dictionary lexicon, which is not enough. For example, in order for spell-checkers to provide intelligent suggestion, they should rely on a set of common phonetic errors for a certain language, character similarity during optical character recognition (for example, i and l, cl and d, m and m), or close layout of certain keys on a keyboard. This form of context is not provided in the known method.
Another example of an improvement provided by bringing the context into the process is the “Did you mean?” functionality of search engines. Suggestion of an alternative query can be found by substituting different fragments of the query with relevant synonyms while performing approximate match in the dictionary of previous queries. Thus context dependent correction rules are needed along with approximate match methods in order to bring context and improve intelligence by better narrowing and ranking of the set of result suggestions.
However, combining the application of text correction rules and practical non-deterministic traversing of the trie is a complicated task which historically has been performed in several passes as described in U.S. Pat. No. 6,616,704.
Another drawback is related to a prevalent practice of storing word fragments along with stand-alone words, particularly, although not exclusively, in applications for natural language processing for languages like German, Dutch, Danish, Swedish, Norwegian, Dutch, Icelandic, Afrikaans, etc. This practice permits the creation of compact dictionaries. A disadvantage to this approach is that the methods of approximate matches for compound words in dictionaries of word fragments do not match exact decompounding methods and require separate implementations.
A further drawback is that the implementation of suggestion gathering depends on the technique used for error value computation. While traversing the dictionary, the trie walker gathers suggestions, which conform to a predefined error tolerance as described in the reference IEEE Transactions on Knowledge and Data Engineering August 1996 (Vol. 8, No. 4) pp. 540-547, H. Shang, T. H. Merrett, “Tries for Approximate String Matching”. Paths where the error value exceeds the error tolerance are rejected by the trie walker.
There are two prevalent techniques for error value computation in practice. For natural language applications, the notion of edit distance is used. Edit distance is a minimum number of changes such as replacement, insertion or deletion of one symbol, which have to be made to match two strings. There is one more operation, which has to be considered for spell aid applications—transposition of two symbols. The second prevalent method of error value computation is sequence-oriented. It is based on the calculation of the number of common substrings of fixed length or n-grams. This method is used in areas such as computational biology, in particular in DNA sequence matching as described in reference R. C. Angell, G. E. Freund, and P. Willett, “Automatic spelling correction using a trigram similarity measure”, Information Processing and Management, 19:255-261, 1983. Thus, there is an open option for the technique for error value computation in approximate string matching in a trie-based dictionary which is desirable to preserve.