Robustness to speaker and environment variability is a crucial issue normally addressed in connection with speech recognition, especially when performance in a real-world environment is concerned. Ideally, one usually likes for speech recognition systems to perform equally well for all speakers and all acoustic environments. For this purpose, the acoustic model in the speech recognition systems is usually trained on a very large collection of speakers and on data collected in various environments. However speaker-independent systems, for example, do not tend to perform as well as speaker-adapted systems. Adaptation usually involves reducing the mismatch between the characteristics of speech features that are specific to a speaker and/or an acoustic environment and the characteristics of an acoustic model trained on general data. It can be done off-line if enrollment data for a specific speaker or environment is available.
Speech features are multi-dimensional vectors. An acoustic model usually includes a set of context-dependent subphone units, each of which is associated to a probability density function (pdf), usually a multi-dimensional Gaussian mixture with diagonal covariances. At run time, i.e. at recognition time, the probability of each input speech feature vector is evaluated with the pdf of each subphone in the model. This operation will be referred to as acoustic scoring. There are essentially two approaches to adaptation: (i) feature space adaptation where the speech feature vectors are transformed to better match the parameters of the model's pdfs, and (ii) model adaptation where the parameters of the model's pdfs are transformed to better characterize the input speech feature vectors.
In feature space approaches, a feature space transform is learnt on the speech features extracted from the enrollment data. At run time, i.e. at recognition time, this transform is applied to the input speech features before they are scored against the acoustic model. A popular feature space adaptation technique is the so-called feature-space MLLR technique, standing for “Maximum Likelihood Linear Regression” (see A. Sankar and C. H. Lee, A Maximum Likelihood Approach to Stochastic Matching for Robust Speech Recognition, IEEE Transactions Speech and Audio Processing, 4:190-202, 1996) where a linear transform is estimated by maximizing the likelihood of the enrollment data. In model adaptation approaches, a transform modifying the parameters of the pdfs is learnt on the speech features extracted from the enrollment data. At run time, i.e. at recognition time, the pdfs with modified parameters are used instead of the original pdfs.
Adaptation is made more difficult in the context of speech recognition devices with low resources where particular acoustic scoring schemes are needed to speed up the Gaussian computation while maintaining a low computational complexity. Such schemes commonly involve manipulating quantized and clustered versions of low-dimension Gaussian pdfs instead of the original Gaussian pdfs. In a scheme that is of interest in the context of the present invention, Gaussian components are “sliced” into Gaussians of smaller dimensions called “bands” (for example, Gaussians of dimension 39 can be sliced into bands of dimension 2, resulting into 19 Gaussians of dimension 2 and one Gaussian of dimension 1). The low-dimension Gaussians in each band are clustered into a smaller set of Gaussians called atoms. (By way of further explanation, “atoms” are the low-dimension Gaussian models that result from splicing into bands the original Gaussian models and from clustering these bands. Each atom includes a mean and of a covariance matrix. Each band of each Gaussian in the original Gaussian models is mapped to a specific atom, while bands of different Gaussians can possibly be mapped to the same atom). As will be explained below, a model structured into bands and atoms is formally equivalent to a CGM model.
In the context of an acoustic scoring scheme with clustered pdfs, feature space adaptation techniques offer the advantage over model adaptation techniques to not require to re-cluster the pdfs since feature space techniques affect only the feature vectors. However, feature space techniques involve a significant computational overhead at run time since the input speech features are transformed before acoustic scoring. To date, a model adaptation technique called “MLLR adaptation of atoms”, which operates directly on the atoms, has been proposed (see J. Huang and M. Padmanabhan, “Methods and apparatus for fast adaptation of a band-quantized speech decoding system.,” U.S. Pat. No. 6,421,641 B1, Jul. 16, 2002). In an MLLR adaptation of atoms, the mean of the atoms undergo a linear transform, while the variance of the atoms are not transformed. Besides all the atoms corresponding to the same band are constrained to undergo the same linear transform.
Another aspect of adaptation in the context of speech recognition devices with low resources is the computational complexity of the adaptation algorithm itself. In general, adaptation algorithms tend to require the computation of statistics from the enrollment data and computation of the new model parameters from these statistics. For devices with low resources, it is important that these two steps do not demand large resources, computational as well as memory. In this regard, both the feature space transformation technique and the MLLR adaptation of atoms are unsuitable.
In view of the foregoing, a need has been recognized in connection with overcoming the shortcomings and disadvantages of conventional arrangements.