1. Field of the Invention
The present invention is directed generally to speech recognition systems and, more particularly, to speech recognition systems utilizing hierarchical connectionist acoustic models.
2. Description of the Background
Statistical speech recognition based on hidden Markov models (HMM) currently is the dominating paradigm in the research community even though several limitations of that technique are repeatedly being discussed. Connectionist acoustic models have proven to be able to overcome some of the drawbacks of HMMs. H. Bourlard et al., "Connectionist Speech Recognition--A Hybrid Approach", Kluwer Academic Press, 1994. In particular, connectionist acoustic models were shown to outperform traditional mixtures of Gaussians based acoustic models on small, controlled tasks using context-independent HMMs.
However, wide-spread use of connectionist acoustic models is hindered by at least two issues: (1) training of connectionist acoustic models is much slower, leading to training times of several days, if not weeks, and (2) poor scalability of connectionist acoustic models to larger systems. Refinement of traditional mixtures of Gaussians based acoustic modeling using phonetic decision trees for polyphonic context modeling has led to systems consisting of thousands of HMM states. Significant gains in recognition accuracy have been observed in such systems. Nevertheless, research in context-dependent connectionist acoustic models has long concentrated on comparably small systems because it was not clear how to reliably estimate posterior probabilities for thousands of states. Application of a single artificial neural network as in context-independent modeling leads to an unfeasibly large number of output nodes. Factoring posteriors based on context, monophone or HMM state identity was shown to be capable of breaking down the global estimation problem into subproblems of small enough size to allow the application of multiple artificial neural networks. H. Franco, "Context-dependent connectionist probability estimation in a hybrid Hidden Markov Model--Neural Net speech recognition system", Computer Speech and Language, Vol. 8, No. 3, 1994; J. Fritsch, et al., "Context-Dependent Hybrid HME/HMM Speech Recognition using Polyphone Clustering Decision Trees", Proc. Of ICASSP '97, Munich 1997; D. J. Kershaw, et al, "Context-Dependent Classes in a Hybrid Recurrent Network HMM Speech Recognition System", Tech. Rep. CUED/F-INFENG/TR217, CUED, Cambridge, England 1995.
Comparable gains in performance were achieved with context-dependent connectionist acoustic models based on that technique. However, factoring posteriors in terms of monophone and context identity seems to be limited to medium size systems. In large systems, non-uniform distribution of the number of context classes again leads to unfeasibly large numbers of output nodes for some of the context networks.
Another problem with current HMM-based speech recognition technology is that it suffers from domain dependence. Over the years, the community has validated and commercialized the technology based on standardized training and test sets in restricted domains, such as the Wall Street Journal (WSJ) (business newspaper texts), Switchboard (SWB) (spontaneous telephone conversations) and Broadcast News (BN) (radio/tv news shows). Performance of systems trained on such domains typically drops significantly when applied to a different domain, especially with changing speaking style, e.g. when moving from read speech to spontaneous speech. D. L. Thomson, "Ten Case Studies of the Effect of Field Conditions on Speech Recognition Errors", Proceedings of the IEEE ASRU Workshop, Santa Barbara, 1997. For instance, performance of a recognizer trained on WSJ typically decreases severely when decoding SWB data. Several factors can be held responsible for the strong domain dependence of current statistical speech recognition systems. One is constrained quality, type or recording conditions of domain specific speech data (read, conversational, spontaneous speech/noisy, clean recordings/presence of acoustic background sources, etc.). Another is vocabulary and language model dependence of phonetic context modeling based on phonetic decision trees. That implies a strong dependence of allophonic models on the specific domain. Another factor is domain dependent optimization of size of acoustic model based on amount of available training data and/or size of vocabulary. While the first of the above-mentioned factors is typically addressed by some sort of speaker and/or environment adaptation technique, the latter two factors are usually not adequately addressed in cross-domain applications.
Consider the scenario of porting a trained recognizer to a different domain within the same language. Usually, a phonetic dictionary for the new domain based on the set of phones modeled by the recognizer can be constructed relatively easily using a large background dictionary and, if necessary, applying a set of phone mapping rules. Also, we consider it justifiable to assume that enough text data is available, such that we can train a statistical language model for the new domain. What typically makes porting efforts expensive and time consuming is the adaptation of the acoustic model. The most common approach of applying supervised acoustic adaptation techniques requires large amounts of transcribed speech data from the new domain to capture the differing statistics reasonably well.
Thus, the need exists for an acoustic model which exhibits full scalability, avoids stability problems due to non-uniform prior distributions and is easily integrated into existing large vocabulary conversational speech recognition (LVCSR) systems. The need also exists for a trained acoustic model to be easily adapted in structure and size to unseen domains using only small amounts of adaptation data.