The present invention relates to the practice of generating word templates and, more specifically, to the practice of reducing data representing word templates in a speech recognition system.
In systems that require digital storage of an analog waveform, a significant amount of memory must be allocated for an accurate representation. In a speech recognition system, where word recognition depends on such accuracy, storing speech digitally requires an excessive amount of memory. This is especially true for speech recognition systems requiring large vocabularies. Each word in the vocabulary is typically represented by a word template. Each word template includes frames, segmented in equal time intervals, representing a spoken word. To practically implement a large vocabulary into a speech recognition system, two problems must be overcome.
The first problem is the extensive memory which is required to digitally store the vocabulary. Memory is expensive in cost and in circuit board real estate.
The second problem is the computation time required to process this representative data. In general, the computation time increases linearly with the amount of memory required for the template data. In systems utilizing large vocabularies, these two problems are an enormous burden for practical operation of a speech recognition system in real-time. Accordingly, the need to reduce the required template data is well recognized in the field of speech recognition.
Reduction of template data can be applied to sounds within a word template which are acoustically similar. Speech is typically time segmented in equal intervals. Each segment is referred to as a frame. For example, words which are spoken slowly often have frames of speech which are merely a long continuation of the same sound. Since frames having acoustically similar sounds do not need to be represented repetitively, there has been discussion of combining these frames into a representative frame. Combining frames in this manner is referred to as clustering.
When clustering any number of word template frames, the resultant frame is somewhat distorted with respect to the original frames due to slight variations of the representative data in each frame. Typically, when two or more frames are measured to be acoustically similar, clustering the frames is not expected to produce an excessive distortion. Techniques for determining an accurate similarity measure between frames are used to determine whether two or more frames should be clustered.
Similarity of frame information is usually measured using a distance calculation, such as the Hamming, or Chebyshev calculation dependent on the type of representative data. Two sequential frames from a word template can be clustered into a single frame if the `distance` between them is less than a predetermined distance. By clustering frames which have a small distance calculated between them, the data representing the speech can be reduced.
However, clustering frames in this manner is a problem when the quantity of frames in the word template is large. To `optimally` reduce the word template, a representative word template must be generated which has the fewest number of representative frames as well as satisfying a distortion criteria for each representative frame. Typically, this requires testing every possible clustering of frames in the word template. The clusters must be selected such that no other sequence of clusters will result in fewer clusters meeting the distortion criteria. The sequence of clusters is hereinafter referred to as a cluster path for the word template. The cluster path which results in the least distortion and the fewest number of clusters is the optimal cluster path. For a word template with a large number of frames, the search for the optimal cluster path results in an excessive amount of computation. For example, consider a word template comprised of 3 frames. There are a total of 4 possible cluster paths to consider, 1 2 3, 1 2 3, 1 2 3, 1 2 3 (each cluster being underlined). For a 5 frame word template, there are 16 possible cluster paths to consider. In general, for a word template comprised of N frames, there are 2.sup.(N-1) possible paths to consider. A word template comprised of 15 frames requires that 16,384 possible cluster paths be considered, with probably only one cluster formation optimally reducing the template data. The computation requirements in considering each of these possibilities is not practical in a real-time environment.
Another problem encountered when clustering in this manner pertains to matching an appropriate clustering method to the particular type of feature data representing the speech. Typically, filter bank information or linear predictive coefficient (LPC) information is used to represent the speech. Clustering a group of frames represented by filter bank information will not always produce the same distortion that LPC information would produce. Hence, minimal cluster combinations for one type of feature data may not be minimal for another type of feature data.
What is needed is a clustering method for word template data that can generate the optimal cluster path efficiently for any type of feature data and distance measure used.