The present invention relates generally to speech and text recognition. More particularly, the invention relates to a system and method for constructing a computer-searchable dictionary. The system organizes the dictionary into clusters to facilitate rapid dictionary searches. The system constructs a clustered dictionary through a process that is much faster than conventional dictionary ordering processes. This enables much larger dictionaries to be clustered while minimizing processor overhead.
Speech recognition systems and text recognition systems often rely on comparing the input test speech or input text with entries in a dictionary. The dictionary is a database containing all possible entries that the recognition system is capable of processing. In a simple text-based system it is possible to organize the dictionary alphabetically, using a tree-structured data structure to store the dictionary. The alphabetically-sorted, tree-structured dictionary works well when the confusability between speech fragment utterances or text characters is low. However, the alphabetically-sorted, tree-structured dictionary does not work well when there is a moderately high likelihood of confusion among utterances or letters.
To demonstrate, in a keyboard entry text application, the identity of each keystroke can be treated as reliable. Thus, as a word is typed, the dictionary searching algorithm can traverse the alphabetically-sorted tree, letter by letter, thereby progressing to the dictionary entry as it is typed. In a speech recognition system, or in any text input system in which the individual speech fragment utterances or letters cannot be deemed reliable, the alphabetically-sorted, tree-structured dictionary gives poor results. For example, if the first input speech fragment utterance or character is confused or misinterpreted, an alphabetically-sorted, tree-structured algorithm will begin to traverse an incorrect path. Thus, if the letter J is mistaken for the letter G, the tree-structured algorithm will fail to find the entry JOHNSON, by having misconstrued the first letter.
Accordingly, in speech recognition and text recognition systems with moderately high probability of confusion, a clustered data structure is preferred. To build a conventional clustered data structure, the dictionary of all possible entries is mapped onto an N.times.N distance matrix (N being the number of entries in the dictionary) in which the distance of each entry from all other entries is calculated and recorded in the matrix. Distance can be computed using a variety of different metrics, the simplest being the geometric distance calculated using the Pythagorean theorem upon the individual parameters that are used to define the dictionary entries. For example, if each dictionary entry is represented by 3 parameters, a, b and c, the distance between two entries is computed according to the Pythagorean theorem as follows. ##EQU1##
The conventional clustered data structure is constructed by examining the n.times.n distance matrix to find the center of the dictionary; then find the point where the distance from all other points is maximum; and then reassign all points to these two cluster centers. The cluster centers are then computed again and the process is repeated iteratively until the desired number of clusters is generated.
Specifically, these two points are used to divide the dictionary into two subsets corresponding to the two points identified. Each entry in the dictionary is assigned to one subset or the other, depending on which of the two points the entry is closest to. The procedure is then repeated again and again, until the original dictionary has been broken into the desired number of subsets or clusters.
The conventional technique for generating these clusters is suitable for small dictionaries, such as on the order of several thousand entries. However, as the dictionary size increases, the conventional cluster forming technique becomes geometrically slower to construct and use rapidly too much memory. Whereas a small dictionary might be subdivided into a suitable number of clusters in a manner of minutes, a large dictionary could take days, weeks or even years to construct in this conventional fashion. The time required to construct the dictionary clusters can pose significant penalties on systems that are designed to periodically merge data from separate dictionaries into a single dictionary, as this may frequently necessitate recomputing the cluster centers.
The present invention takes a different approach to constructing a clustered dictionary. According to the invention, the initial dictionary is first subdivided according to a predefined rule-based system to provide a plurality of first cluster centers. By way of example, the rule-based system may sort the dictionary alphabetically and then divide the dictionary into groups based on the number of letters contained in each word entry. Within each length category all words that differ by the first letter are used as a first cluster center. The cluster centers identify the words or names of the same length that begin by the same letter. That is the presently preferred rule-based system for providing the first cluster centers. In general, however, any suitable rule-based system that produces the desired number of cluster centers may be employed. The number of cluster centers generated by the rule-based system is dictated largely by processor speed and desired time constraints. The presently preferred system selects on the order of 260 cluster centers for a 20,000 word dictionary.
Having selected the first cluster centers, all entries in the initial dictionary are compared to each of the cluster centers. At least a portion of the dictionary entries are assigned to the closest cluster center, thereby subdividing the initial dictionary into a plurality of first clusters.
Next, each of the first clusters is examined to identify its true cluster center, thereby defining a plurality of second cluster centers. The first cluster centers and the second cluster centers do not necessarily match one another even though they may be associated with the same one of the first clusters.
Having identified the second cluster centers, the initial dictionary is subdivided into a plurality of third clusters by comparing each entry of the initial dictionary to each of the second cluster centers, and assigning each entry to the one of the second cluster centers that represents the closest degree of similarity.
The above process of identifying the cluster centers and reassigning each entry of the initial dictionary to the closest center may be repeated over a series of iterations, if desired. The system provides a clustered dictionary for electronic speech and text recognition applications, without the costly computational overhead found in conventional cluster generation systems. For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.