The present invention relates to the practice of generating reliable word templates from speech utterances for a speech recognition system and, more particularly, to the practice of generating reliable word templates by averaging repetitious speech during word training for a user dependent speech recognition system.
A typical method of word template generation is to use a single utterance of a given word as the word template. This has been known to cause problems due to the fluctuation in which words are spoken by the user of the system. Words can be distorted by the speed in which they are spoken as well as by physical irregularities in the speaker's voice. For example, a word will distort when the user has a cold or when the user's lips smack.
Another typical method of word template generation is to use multiple training utterances stored as individual templates for a given word. Although this method can realize good accuracy in matching spoken words to the prestored word templates, it may be impractical due to the excessive computation time and extensive memory it requires.
Because of problems related to these methods, there has been a number of word template generation schemes developed which use the concept of averaging spoken words during voice training. Words spoken for training are referred to as tokens. Although the word "token" implies a single spoken word, a "token" can merely represent a short phrase. The goal in averaging tokens in a word template generation scheme is to generate a word template which best typifies the word as spoken by the speech recognition system user. Prior art methods which average training tokens generate the word template from tokens which disproportionately contribute to the resultant word template.
In a paper by Y. J. Liu, entitled "On Creating Averaging Methods", ICASSP 1984 pp. 9.1.1 through 9.1.4, described are a few techniques for generating templates by averaging. The first technique is called `Dynamic Averaging`. The first requirement in this technique is to choose a token `whose length is representative`. Subsequent tokens are averaged into this chosen token, each mapped to the `representative length`. The disadvantage of this technique is twofold. The first disadvantage is that if one of the tokens is not of a similar length, mapping its length onto a time axis of a chosen does not allow the token to contribute to the averaged token with respect to length. If the token is at all representative of the spoken word, it should be averaged in data and length. If the token is not representative of the spoken word, it should not be averaged at all.
The second disadvantage of this technique is that the resultant word template is dependent on the order in which the tokens are averaged into the chosen token. Hence, the optimal resultant word template cannot be found since the optimal order in averaging the tokens cannot be determined.
The second method Liu describes is called `Dynamic Local Averaging`. This method also requires that a token be chosen which is of "representative length". This chosen word is then averaged with other spoken words according to the distances between frames within the words. Frames from words with minimum corresponding frame distances are averaged together to generate the representative word template. Accordingly, this method chooses a token, and then averages it with other tokens along a path in time, but not necessarily along a global time alignment path to produce the optimal alignment.
The last method Liu describes, `Linear Averaging`, requires choosing a `common length` and linearly changing the time axis of each token to that common length. The tokens are then averaged vertically, according to aligned frames. This technique may not generate a very representative word template.
Notwithstanding the peculiarities of these prior art word template generation methods, each method requires an initial storage of all original tokens. Since most speech recognition systems are memory limited, it is usually desirable that any spare memory be used for vocabulary storage. This additional memory requirement of the above discussed methods only further burdens the speech recognition system.
When a user is training the speech recognition system, the tokens are meant to each be representative of the word as spoken by the user. Each token should have an equal contribution to the generated word template, unless of course, one token happens to be entirely different from the others. In summary, the desired results for a word template generation method is one which produces a final word template with the following results:
The final word template should have an averaged time axis such that each token equally contributes to the final word template; PA0 When the average length of all the tokens is fractional, the generated word template will have a length which is rounded off to the closest number of frames; PA0 As the number of tokens used in the averaging process increases, each additional token should have a reduced effect on the word template being generated as a whole; PA0 The memory required for generating the word template should be a minimum; and PA0 The generated word template is independent of the order in which the tokens are averaged.