Aspects of the present invention relate to automated speech processing. Other aspects of the present invention relate to adaptive automatic speech recognition.
In a society that is becoming increasingly “information anywhere and anytime”, voice enabled solutions are often deployed to provide voice information services. For example, a telephone service company may offer a voice enabled call center so that inquiries from customers may be automatically directed to appropriate agents. In addition, voice information services may be necessary to users who communicate using devices that do not have a platform on which information can be exchanged in conventional textual form. In these applications, an automatic speech recognition system may be deployed to enable voice-based communications between a user and a service provider.
Automatic speech recognition systems usually rely on a plurality of automatic speech recognition models trained based on a given corpus, consisting of a collection of speech from diversified speakers recorded in one or more different acoustic environment. The speech models built based on such given corpus capture both the characteristics of spoken words and that of the acoustic environment in which the spoken words are uttered. The accuracy of an automatic speech recognition system depends on the appropriateness of the speech models it relies on. In other words, if an automatic speech recognition system is deployed in an acoustic environment similar to the acoustic environment in which the training corpus is collected, the recognition accuracy tends to be higher than when it is deployed in a different acoustic environment. For example, if speech models are built based on a training corpus collected in a studio environment, using these speech models to perform speech recognition in an outdoor environment may result in very poor accuracy.
An important issue in developing an automatic speech recognition system that may potentially be deployed in an adverse acoustic environment involves how to adapt underlying speech models to an (adverse) acoustic environment. There are two main categories of existing approaches to adapt an automatic speech recognition system. One is to re-train the underlying speech models using new training data collected from the deployment site (or adverse acoustic environment). With this approach, both the original training corpus and speech models established therefrom are completely abandoned. In addition, to ensure reasonable performance, it usually requires a new corpus of a comparable size. This often means that a large amount of new training data needs to be collected from the adverse acoustic environment at the deployment site.
A different approach is to adapt, instead of re-training, speech models established based on an original corpus. To do so, a relatively smaller new corpus needs to be generated in an adverse acoustic environment. The new corpus is then used to determine how to adapt existing speech models (via, for example, changing the parameters of the existing models). Although less effort may be required to collect new training data, the original corpus is also put in no use.
Collecting training data is known to be an expensive operation. The need of acquiring new training data at every new deployment site not only increases the cost but also often frustrates users. In addition, in some situations, it may even be impossible. For example, if a speech recognition system is installed on a hand held device which is used by military personnel in battlefield scenarios, it may be simply not possible to re-collect training data at every locale. Furthermore, abandoning an original corpus, which is collected with high cost and effort, wastes resources.