1. Field of the Invention
The present invention relates to the field of speech processing, and, more particularly, to resource conservative transformation based unsupervised speaker adaptation.
2. Description of the Related Art
A central concern of many modern speech recognition systems is an improvement of system accuracy. One accuracy improving technique is to dynamically adapt a speech recognition system to a speaker at runtime, which is referred to as unsupervised speaker adaptation. Unlike historic speaker characteristic learning techniques that often required extensive training interactions, unsupervised speaker adaptation occurs transparently as a background process during speech interactive sessions. Unsupervised speaker adaptation is a process that takes advantage of data available in an audio stream and a likelihood that a user of the system is providing input within a domain of the system. Unsupervised speaker adaptation can result in significant accuracy gains. Unsupervised speaker adaptation is one specific type of adaptive acoustic modeling.
FIG. 1 (prior art) provides an overview of an adaptation/normalization scheme 100. In the scheme, speech recognition can be viewed as a combination of feature vectors of a feature space 110 and acoustic models in a model space 130. A mismatch is given if both spaces 110, 130 do not belong to the same level 140-144. For instance, in the case of non-adaptive acoustic modeling, a strong mismatch can exist between test data XTest 132 and ⊖Train 134. This mismatch results in part of a requirement of a speaker independent automatic speech recognition (SI-ASR) to cope with a significant amount of variability in an acoustic signal. Variability results from different transmission channels, different ambient noise environments, different vocal characteristics among different speakers, and the like.
Scheme 100 shows these abstract data levels 140, 142, and 144. The goal of adaptation scheme 100 is to overcome the mismatch for a combination of feature vectors X and acoustic models ⊖ from different levels. The mismatch can be reduced in the feature space (e.g. normalization—illustrated by the left side of scheme 100) or in the model space (adaptation—illustrated by the right side of scheme 100). In normalization, approaches have to be applied to the training (XTrain) and test data (XTest 132) to gain maximum performance. Adaptation schemes modify the parameters of the acoustic model directly in order to reduce a mismatch. Adaptation schemes can be capable of reducing the mismatch between XTest 132 and ⊖Train 134 by (ideally) transforming ⊖Train 134 into 0 ⊖Test 136.
Current adaptation and normalization approaches can be categorized into two classes: the maximum a-posteriori (MAP) family and the transformation family. MAP follows the principle of Bayesian parameter estimation, where parameters of the acoustic model itself are modified. A MAP approach can involve a relatively huge number of parameters and a relatively huge amount of adaptation data to function. In contrast, a transformation approach transforms the feature vectors without affecting parameters of underlying acoustic or visual models (i.e., does not change Hidden Markov Model parameters).
The present invention is concerned with adaptation (from ⊖Train 134 to ⊖Test 136) using a transformation approach. During a transformation approach, computing a transformation is a relatively resource intensive operation. One reason for this cost is that conventional transformation techniques require that feature vector data representing an entire speech utterance be cached in memory. In an embedded system, the transformation computation can take as long as twenty five percent of the utterance length (e.g., a four second utterance can have an associated transformation computation time of approximately one second). Additionally, conventional approaches generate a transformation as a percentage of an utterance length, which makes determining resource cost for creating the transformation an unpredictable endeavor. In comparison to costs for creating a transform used during unsupervised speaker adaptation, applying the transform is a relatively inexpensive process.
The high resource cost of implementing transformation based conventional speaker adaptation and the relative unpredictability of resource consumption have prevented unsupervised speaker adaptation from being implemented on resource constrained devices, such as mobile phones, media playing devices, navigation systems, and the like. Additionally, unsupervised speaker adaptation is often not implemented on more robust devices (e.g., desktops and notebooks) with adequate processing resources available, since unsupervised speaker adaptation resource consumption lowers device performance—making even robust computing devices appear sluggish or non-responsive. What is needed is a new, resource conservative technique for implementing unsupervised speaker adaptation principles, which will provide accuracy improvements without the hefty and unpredictable performance/resource costs.