Automatic speech recognition (ASR) is usually accomplished by determining the words that were most likely spoken, given a speech signal. This is done by comparing a set of parameters describing the speech signal with a set of trained acoustic model parameters. Accurate speech recognition requires that the trained acoustic models be able to distinguish the spoken words successfully. Hence, much effort is expended to produce acoustic models that provide the level of performance desired. The units of trained acoustic models may correspond to words, monophones, biphones or triphones. For large vocabulary speech recognition applications, triphone acoustic modeling, which comprehends the prior and subsequent phone context of a given phone, outperforms monophone modeling, and so triphones are the acoustic models of choice in such applications.
While triphones provide better large vocabulary recognition, the number of triphones is often larger than the number of monophones by two orders of magnitude. For example, if a language requires 50 monophones for its representation, there will likely be in the range of 5000 triphones in the language. Training thousands of triphones in any language is complex and time-consuming. Some steps are machine intensive; while others require a great deal of human intervention, which is error-prone. Such elements impact the cost and time to market associated with training acoustic triphone models for any new language.
Current acoustic training techniques have been known and published for some time. See for example, S. Young, D. Kershaw, J. Odell, D. Ollason V. Valttchev and P. Woodland, The HTK Book (Version 3.0), Cambridge, England, July 2000. Monophone seeding constitutes the foundation of any training operation. Ideally, monophone seeding provides the subsequent steps in the training algorithm good monophone models in the language of consideration. Such monophone models can easily be estimated if one possesses a database that has been labeled and time marked all the way to the monophone level. This labeling and time marking requires extensive human intervention to ensure correct labeling of the monophones within an utterance and the correct location of the acoustic signal corresponding to each monophone. Because of the need for human intervention and the need for large databases for triphone training, such labeling and time marking is costly and so it is rarely performed.
If such hand labeling is not available, seed monophones can be obtained through bootstrapping, which makes an estimate of the monophones using other already trained acoustic models depending on their acoustic similarities. While this technique is useful if the monophone similarities can be clearly estimated, it often requires a great deal of human interaction both to analyze which monophones are similar acoustically and to adapt topology of the reference model to fit with that of the target model.
Other current methods adapt the acoustic information of an existing set of monophone models in a reference language using a small database in the target language. However, the time and cost advantage of the adaptation technique is usually obtained at the cost of reduced recognition performance in the target language, since the monophone models are not optimal for the new language.
If no other method is available, monophone seeding may use a simple “flat start” method, whereby one initial model is constructed based on global statistics of the entire target training database. This model is duplicated to form the model for all monophones. This technique is rarely used for high-end speech recognition systems because it significantly impacts recognition performance.
Existing triphone training techniques require several steps. The first step is often to duplicate a set of trained monophone acoustic models for each triphone context, thus producing the initial triphone models. The triphone models can then be trained. However, the initial triphone models have a significant amount of monophone acoustic context, which can result in sub optimally trained triphone models.
The large number of triphones results in an excessive number of model parameters that must be trained, which requires extremely large training databases in order to successfully estimate the parameters. In order to reduce the number of parameters needed to represent the triphone models, after preliminary training of the triphone models, another procedure clusters the parameters. During clustering, parameters of similar triphones are linked together to obtain a joint and therefore more robust estimate of the clustered triphone parameters. The success of clustering is based on correctly identifying the parameters that are correlated with each other and should be grouped.
Existing methods of clustering triphone model parameters require significant human involvement. Such techniques can be either data driven or tree based. In the first case, triphones that tend to produce similar speech features are clustered. One limitation of data driven clustering is that it does not deal with triphones for which there are no examples in the training data. In the second case, a phonetic binary decision tree is built, with yes/no question attached at each node. All triphones in the same leaf node are then clustered. With such a tree, any triphone in the language can be constructed, if the tree questions are based on articulatory features of phones. Before any tree building can take place, all of the possible phonetic questions must be manually determined depending on the specific set of phonemes characterizing the target language and their articulatory phonetic characteristic (e.g. voiced/unvoiced, place and manner of articulation, position of the tongue and jaw, strident, open jaw, round lips, long . . . ).
The disadvantage of direct application of these existing training techniques is time and cost associated with human intervention which needs to be repeated for each additional language. In addition, the resulting acoustic model sets are not optimized by selecting the best candidate from the large multitude of possible clustering candidates, resulting in degraded speech recognition performance and/or excessive model size.