Applications for very large vocabulary continuous speech recognition systems include multimedia indexing and call center automation. A very large speech database is needed to train a single acoustic model employed by such speech recognition systems. Typically, the acoustic model is speaker-independent and gender-independent; i.e., the model was trained with data from many different speakers, both male and female. A major difficulty in modeling speaker-independent continuous speech is that important variations in the speech signal are caused by inter-speaker variability, such that the spectral distributions have higher variance than corresponding speaker-dependent distributions. As a result, overlap between different speech units leads to weak discriminative power.
Speaker adaptive training is a method of estimating the parameters of continuous density HMMs for speaker independent continuous speech recognition. It aims at reducing inter-speaker variability in order to get enhanced speaker independent models. By reducing the inter-speaker variability, speaker adaptive training finds a speaker-independent acoustic model that could be seen as a compact central point in the database. This model will be compact with reduced variance and well suited for adaptation. However, though this method of constructing an acoustic model is a powerful one, the performance of speaker adaptive training on extremely large databases soon reaches a limit. Intuitively, it is impossible to find one unique compact acoustic model that models the entire database with accuracy.
Therefore, it is desirable to provide an improved technique for constructing compact acoustic models for use in a very large vocabulary continuous speech recognition system.