Although there are very few phones in a language, modeling those few phones is not sufficient for speech recognition purpose. The coarticulation effect makes the acoustic realization of the same phone in different context very different. For example, English has about 40 to 50 phones, Spanish has a little more than 20 phones. Training only 50 phonetic models for English is not sufficient to cover all the coarticulation effects. Context-dependent models are considered for the speech recognition purpose because of this reason. Context-dependent phonetic modeling has now become standard practice to model variations seen in the acoustics of a phone caused by phonetic context. However, if only immediate contexts are considered, there are 50.sup.30 =125,000 models to be trained, this large number of models defeats the motivation of using phonetic models in the first place. Fortunately, some contexts will result in large acoustic difference, some will not. Therefore, the phonetic models can be clustered to not just reduce the number of models but also increase the training robustness.
The art of figuring out how to cluster phonetic models is one of the core research areas in the speech community for large vocabulary speech recognition. The clustering algorithm needs to achieve the following three goals: 1) maintaining the high acoustic resolution while achieving the most clustering, 2) all the clustered units can be well trainable with the available speech data and 3) being able to predict unseen contexts with the clustered models. Decision tree clustering using phonological rules has been shown to achieve the above objectives. See for example D. B. Paul, "Extensions to Phone-state Decision-tree Clustering: Single Tree and Tagged Clustering," Proc. ICASSP 97, Munich, Germany, April 1997.
Previously, applicant reported on FeaturePhones, a phonetic context clustering method which defines context in articulatory features, and clusters the context at the phone level using decision trees. See Y. H. Kao et al. "Toward Vocabulary Independent Telephone Speech Recognition," ICASSP 1994, Vol. 1, pgs. 117-120 and K. Kondo et al. "Clustered Interphase or Word Context-Dependent Models for Continuously Read Japanese," Journal of Acoustical Society of Japan, Vol. 16, No. 5, pgs. 299-310, 1995. This proved to be an efficient clustering method when the training data was scarce, but was too restrictive to take advantage of significantly more training data.