Speech recognition systems commonly translate spoken words into text. Speech recognition systems are typically statistical pattern classifiers trained with training data. However, acoustic mismatch between the speech data seen in deployment and speech data used to train the speech recognizer can cause degradation in performance. Moreover, such acoustic mismatch can be caused by various sources of variability. Examples of variability sources that can cause acoustic mismatch include the environment and the speaker. Conventional approaches for reducing the acoustic mismatch and enhancing performance commonly involve employing acoustic model adaptation. However, techniques for environmental or speaker adaptation have typically been developed independently.
Environmental adaptation, for instance, is commonly performed using techniques that utilize a parametric model that explains how clean speech is corrupted by additive and convolutional noise. Examples of such techniques include parallel model combination (PMC) and vector Taylor series (VTS) adaptation. These techniques can adapt parameters of the speech recognizer based on a small observation of the noise. However, in order for such approaches to operate, the acoustic model is commonly trained from clean speech or using specialized noise-adaptive training techniques, which can be impractical or costly.
In contrast, common techniques for speaker adaptation are data-driven approaches in which model parameters are transformed in a manner that maximizes a likelihood of adaptation data. For example, various versions of maximum likelihood linear regression (MLLR) or constrained maximum likelihood linear regression (CMLLR) can use one or more affine transforms to modify Gaussian parameters of the speech recognizer. These techniques need not be particular to speaker adaptation and can be used to compensate for other types of acoustic mismatch, including environmental noise. Yet, with these conventional adaptation approaches that are based on maximizing the likelihood of adaptation data, the sources of the acoustic mismatch for which the estimated transforms are compensating are commonly unknown. This can inhibit these transforms (and the adaptation data in general) from being reused for a speaker while in different acoustic environments, for instance.
Another conventional technique provides for joint environment and speaker adaptation using Jacobian adaptation for noise compensation combined with MLLR for speaker adaptation. More recently, VTS adaptation was used to update both means and variances of MLLR-compensated acoustic models. The VTS noise parameters and the MLLR transforms were jointly estimated using an iterative approach. However, these approaches combine different adaptation strategies for different sources of variability. Moreover, VTS assumes a clean acoustic model, which can make use of data collected from actual deployed applications very difficult and computationally expensive. Further, some acoustic features may be incompatible with such techniques.