Searching through large alpha and numeric data structures involving nodes such as in the Internet, or large directories or data bases of data (each of which are referred to herein as a “Dictionary”) requires significant computing power due to the large number of possible locations in which potentially relevant information is stored. Additionally, the possibility that the search string or data has been mistyped or recorded erroneously presents additional difficulties, particularly when considering large data sets such as those involving the Internet, or large directories or data bases of data.
Previous pattern recognition techniques using symbol-based errors have used the Levenshtein metric as the basis of comparison. The inter-symbol distances can be of a 0/1 sort, parametric or entirely symbol dependent, in which case, they are usually assigned in terms of the confusion probabilities. The metric which uses symbol-dependent distances is referred to herein as the Generalized Levenshtein Distance (GLD). A previously known method of correction of garbled words based on the Levenshtein metric, proposed an efficient algorithm for computing this distance by utilizing the concepts of dynamic programming.
Dictionary-based Approaches
Most of the time-efficient methods currently available, require that the maximum number of errors be known a priori, and these schemes are optimized for the case when the edit distance costs are of a 0/1 form. For examples, Du et. al. (An approach to designing very fast approximate string matching algorithms, IEEE Transactions on Knowledge and Data Engineering, 6(4):620-633, (1994)) proposed an approach to design a very fast algorithm for approximate string matching which divided the dictionary into partitions according to the lengths of the words. They limited their discussion to cases where the error distance between the given string and its nearest neighbors in the dictionary was “small”.
Bunke (Fast approximate matching of words against a dictionary. Computing, 55(1):75-89, (1995)) proposed the construction of a finite state automaton for computing the edit distance for every string in the dictionary. These automata are combined into one “global” automaton that represents the dictionary, which, in turn, is used to calculate the nearest neighbor for the noisy string when compared against the active dictionary. This algorithm requires time which is linear in the length of the noisy string. However, the number of states of the automaton grows exponentially. Unfortunately, the algorithm needs excessive space, rendering it impractical. For example, for the English alphabet with 26 characters, the minimum number of possible states needed is 29,619! for processing a single string in the dictionary.
Oflazer (Error-tolerant finite state recognition with applications to morphological analysis and spelling correction, Computational Linguistics, 22(1):73-89, (1996)) also considered another method that could easily deal with very large lexicons. The set of all dictionary words is treated as a regular language over the alphabet of letters. By providing a deterministic finite state automaton recognizing this language, Oflazer suggested that a variant of the Wagner-Fisher algorithm can be designed to control the traversal through the automaton in such a way that only those prefixes which could potentially lead to a correct candidate X+ (where GLD(X+, Y)<K) be generated. To achieve this, he used the notion of a cut-off edit distance, which measures the minimum edit distance between an initial substring of the incorrect input string, and the (possibly partial) candidate correct string. The cutoff-edit distance required the a priori knowledge of the maximum number of errors found in Y and that the inter symbol distances are of 0/1 sort, or a maximum error value when general distances are used.
Baeza-Yates et. al (Fast approximate string matching in a dictionary. in Proceedings of the 5th South American Symposium on String Processing and Information Retrieval (SPIRE'98), IEEE CS Press, pages 14-22, (1998)) proposed two speed-up techniques for on-line approximate searching in large indexed textual databases when the search is done on the vocabulary of the text. The first proposal requires approximately 10% extra space and exploited the fact that consecutive strings in a stored dictionary tend to share a prefix. The second proposal required even more additional space. The proposal here was to organize the vocabulary in such a way as to avoid the complete on-line traversal. The organization, in turn, was based on the fact that they sought only those elements of H which are at an edit distance of at most K units from the given query string. Clearly, the efficiency of this method depends on the number of errors allowed.
The literature also reports some methods that have proposed a filtering step so as to decrease the number of words in the dictionary, and that need to be considered for calculations.
Dictionaries Represented as Tries
For the purpose of this document, a “Trie” is a data structure that can be used to store a dictionary when the latter is represented as a set of words or strings, the data being represented in the nodes in which the alphabet/symbols of the word/string is stored and there being branches between nodes such that the words/strings of the dictionary are located on paths within the Trie and all words/strings sharing a prefix will be represented by paths branching from a common initial path. FIG. 1 shows an example of a Trie for a simple dictionary of words {for, form, fort, forget, format, formula, fortran, forward}.
In terms of notation, A is a finite alphabet, H is a finite (but possibly large) dictionary, and μ is the null string, distinct from λ, the null symbol. The left derivative of order one of any string Z=z1z2 . . . zk is the string Zp=z1z2 . . . zk−1. The left derivative of order two of Z is the left derivative of order one of Zp, and so on.
FIG. 1 illustrates the main advantage of the trie as it only maintains the minimal prefix set of characters that is necessary to distinguish all the elements of H.
The trie has the following features:                1. The nodes of the trie correspond to the set of all the prefixes of H.        2. If X is a node in the trie, then Xp, the left derivative of order one, will be the parent node of X, and Xg, the left derivative of order two, will be the grandparent of X.        3. The root of the trie will be the node corresponding to μ, the null string.        4. The leaves of the trie will all be words in H, although the converse is not true.        
With respect to the nodes of the Trie, a node is called a “Terminal” node if it represents the end of a word from the dictionary, even if that node is not the leaf node. Clearly, leaf nodes are necessarily also Terminal nodes.
With regard to traversal, the trie can be considered as a graph, which can be searched using any of the possible search strategies applicable to AI problems. The literature includes two possible strategies that have been applied to tries, namely the Breadth First Search strategy (see Kashyap et. al. (An effective algorithm for string correction using generalized edit distances -i. description of the algorithm and its optimality, Inf. Sci., 23(2):123-142, (1981)), and Oommen et. al. (Dictionary-based syntactic pattern recognition using tries, Proceedings of the Joint IARR International Workshops SSPR 2004 and SPR 2004, pages 251-259, (2004))) and the Depth First Search strategy (see Shang et. al. (Tries for approximate string matching, IEEE Transactions on Knowledge and Data Engineering, 8(4):540-547, (1996))).
One of the first attempts to avoid repetitive computations for a finite dictionary, was the one which took advantage of prefix information, as proposed by Kashyap et. al. (An effective algorithm for string correction using generalized edit distances -i. description of the algorithm and its optimality, Inf. Sci., 23(2):123-142, (1981)). They proposed a new intermediate edit distance called the “pseudo-distance”, from which the final Generalized Levenshtein Distance can be calculated by using only a single additional computation. However, the algorithm in (Kashyap et. al. (An effective algorithm for string correction using generalized edit distances -i. description of the algorithm and its optimality, Inf. Sci., 23(2):123-142, (1981))) was computationally expensive, because it required set-based operations in its entire execution. An efficient Breadth First Search strategy has been recently proposed by Oommen et. al. (Dictionary-based syntactic pattern recognition using tries, Proceedings of the Joint IARR International Workshops SSPR 2004 and SPR 2004, pages 251-259, (2004)), which demonstrated how the search can be executed by doing a feasible implementation for the concepts introduced by Kashyap et. al. (An effective algorithm for string correction using generalized edit distances -i. description of the algorithm and its optimality, Inf. Sci., 23(2):123-142, (1981)). This was achieved by the introduction of a new data structure called the Linked Lists of Prefixes (LLP), which can be constructed when the dictionary is represented by a trie. The LLP permits the level-by-level traversal of the trie, and permits the Breadth First Search calculations using the pseudo-distance and the dynamic equations presented by Kashyap et. al. (An effective algorithm for string correction using generalized edit distances -i. description of the algorithm and its optimality, Inf. Sci., 23(2):123-142, (1981)). An extra memory location was needed at each node to store pseudo-distances calculated so far, and which were needed for further calculations.
Shang et. al. (Tries for approximate string matching, IEEE Transactions on Knowledge and Data Engineering, 8(4):540-547, (1996)) used the trie data structure for both exact and approximate string searching. First of all, they presented a trie-based method whose cost was independent of the document size. They then proposed a k-approximate match algorithm on a text represented as a trie, which performed a DFS on the trie using the matrix involved in the dynamic programming equations used in the GLD computation (see Wagner et. al. (The string-to-string correction problem, Journal of the Association for Computing Machinery (ACM), 21:168-173, (1974))). Besides the trie, they also needed to maintain a matrix to represent the Dynamic Programming (DP) matrix required to store the results of the calculations. The trie representation compresses the common prefixes into overlapping paths, and the corresponding column (in the DP matrix) needs to be evaluated only once. The authors (see Shang et. al. (Tries for approximate string matching, IEEE Transactions on Knowledge and Data Engineering, 8(4):540-547, (1996))) further applied an existing cutoff strategy, referred to as Ukkonen's cutoff (see Ukkonen (Algorithm for approximate string matching. Information and control, 64:100-118, (1985))), to optimize column calculations for the DP matrix and also to abort unsuccessful searches. However, in order to apply these cutoff principles, the user has to know the maximum number of errors, K, a priori, and also resort to use 0/1 costs for the inter-symbol edit distances, which is not a requirement in the present invention.