For large vocabulary ASR (automatic speech recognition), acoustic models are commonly trained on very large databases consisting of hundreds of hours of speech. These recorded databases are available for a price much less than the cost of repeating the collection effort to obtain new training data. Models trained from these data can yield impressive performance (error rates of less than 5%) on recognition tests of dictation with essentially open vocabularies.
Commonly the acoustic models trained from such a large database are “phonetic models”, since they break speech up into phoneme-like units. In a speech recognition system with phonetic word models, each word has one or more associated phonetic pronunciations, also called phonetic spellings, which indicate the sequence of one or more phonemes that define the word's pronunciation. The word's acoustic model is generated as needed from the concatenation of the acoustic models associated with the sequence of phonemes indicated by its phonetic spelling.
A common technique in the field is to represent the sound of each phoneme as a triphone: a phonetic model consisting of one or more acoustic nodes, and representing the sound of a given phoneme when it occurs in the context of the phoneme that precedes it, and the one that follows it. The way that a given phoneme will be realized depends on the phonemes that surround it: the term for this effect on a given speech sound caused by the sounds that precede or follow it is “coarticulation”. Because a triphone models a “phoneme-in-context”, that is, because it models the occurrence of a given first phoneme when it is preceded by a given second phoneme and followed by a given third phoneme, it does a reasonably good job of modeling the changes in the realization of a phoneme that result from coarticulation. By using several nodes, or stationary acoustic states, in the triphone model, one can represent the transitions in sound that occur between successive phonemes. These transitions (which are not instantaneous) occur as a result of the fact that the human vocal apparatus as an instrument is somewhat like a trombone, in the sense that its components—such as the tongue, teeth, and lips—that work together to create different sounds have to move in a continuous manner between the different positions associated with the formation of different phonemes.
Empirically, it has been determined that good speech recognition performance can be produced by using as the individual acoustic node models in the triphone models, acoustic node models that are derived from a cluster of similar node models which occur in different triphones. Such cluster node models are derived by clustering the individual acoustic nodes of different triphones into node groups; deriving a statistical model of each such node group; and using the statistical model of each group as the model for each of the individual triphone nodes in that group.
Standard techniques exist for automatically clustering such nodes based on linguistic knowledge and/or statistical information. U.S. Pat. No. 5,715,367, entitled “Apparatuses And Methods For Developing And Using Models For Speech Recognition” issued to Gillick et al., on Feb. 3, 1998, provides a good description of triphone models and methods for automatically clustering their associated acoustic nodes. This U.S. patent is hereby incorporated herein by reference in its entirety.
The representation of acoustic nodes by statistical models of the node group to which they belong results in better estimates of acoustic model parameters for nodes, because it tends to cause each node to have more training data associated with it. Phonetic models work well even for words never seen in the training data, because they allow a given phoneme or phoneme node model that has received training data in the utterances of one or more words to be used in the models of other words that contain the phoneme or phoneme node, respectively. Furthermore, in systems in which triphone nodes are clustered using knowledge indicating which triphone nodes are likely to have similar sounds, workable models can be created for a triphone that has never been uttered in the training data based on other similar triphones or phoneme nodes for which training data has been received.
Virtually all large vocabulary systems use phonetic models, because their benefits are particularly important in large vocabulary systems. First, they can decrease the amount of data required to represent a large number of word models, since the acoustic models of words can be represented by phonetic spellings. Secondly, they can decrease the computation required for recognition of a large number of words, since they allow the scoring of speech sounds against a sequence of one or more phonemes to be used in the scoring of multiple words that share such a phonetic sequence. Thirdly, and perhaps most importantly, they can greatly reduce the amount of training data necessary to train up a large vocabulary or to adapt large vocabulary to a particular speaker, as has been indicated above.
However, phonetic models are not optimal for all speech recognition systems, particularly smaller vocabulary speech recognition systems. Small vocabulary systems typically use “whole-word” models (i.e., non-phonetically spelled word models). In systems using whole-word models, a separate acoustic model is produced for each word. In the past, for this to work well for each of many different speakers, the training database must include recordings by many different people of each such word. Obtaining sufficient training data for a new vocabulary word generally requires a new, expensive, and time-consuming data collection. This is because even a large general-purpose speech database may not have any, or enough, samples of certain desired individual words. This is particularly true if the desired words are uncommon or made-up words, or in the case of a discrete utterance recognizer, correspond to a run-together sequence of words, such as is commonly used in many discrete utterance small-vocabulary command-and-control speech recognition applications.
One prior art approach, which can work well in some circumstances, is to use as a whole-word (or non-phonetic) acoustic model of a desired word the sequence of phonetic acoustic node models derived from a large-vocabulary database that corresponds to the phonetic spelling of the desired word.
One of the issues with using such a method to generate whole-word models is that it requires that the channel-normalization procedure used in speech recognition with such whole-word models be the same as that used in the training of the phonetic models. This is because differences in channel normalization can have such a profound effect on acoustic samples that word models trained upon acoustic data that has been normalized in one way will often be close to useless in the recognition of words in acoustic data that has been normalized a different way.
Most phonetic models are trained using stationary or quite slowly adapting channel normalization. This use of a relatively stable channel normalization is commonly used for phonetically based systems. More stable channel normalization tends to provide better channel normalization when used with acoustic data recorded in a background relatively free of changing background noises or changing recording characteristics. This is because slower channel normalization provides more time to accurately model background noise and to model the acoustic properties of a given recording set-up. It is also because slower channel normalization is less likely to mistake characteristics of the speech signal itself as channel characteristics that are to be normalized out.
But many applications cannot use a slowly-varying channel-normalization scheme, including many small vocabulary command-and-control applications, either because the channel itself is likely to be changing rapidly, such as due to rapidly changing background noise or rapid changes in the user's position relative to the microphone, or because a typical interaction with the application is likely to be too short for slow channel normalization to form a good estimate of the channel. Also, rapid channel normalization can be useful in situations in which it is desirable, either for purposes or speed of response or computational efficiencies, to be able to start recognizing the initial portions of an utterance before later parts of the utterance have been received.
Another problem with the prior art method of generating whole-word acoustic word models by concatenating acoustic models associated with phonetic spellings is that it limits the resulting whole-word acoustic models to ones having the same model structure as the phonetic models from which they are derived. For example, if the phonetic models are very high quality models using a large number of basis functions, such as Gaussians, to represent the probability distribution associated with each parameter value of a node, the resulting word model created by this method will have the same number of basis functions associated with each such parameter. This can be a disadvantage when it is desirable to have word models that require less computation to use or less memory space to store.