The present invention relates generally to speech recognition systems and, more particularly, to methods and apparatus for rapidly adapting such speech recognition systems to new acoustic conditions via cumulative distribution function matching techniques.
A real-world speech recognition system encounters several acoustic conditions in the course of its application. For instance, a speech recognition system that handles telephony transactions can be reached through a regular handset, a cellular phone or a speakerphone. Each represents a different acoustic environment. Currently, it is well known that a system trained only for a particular acoustic condition degrades drastically when it encounters a different acoustic condition. To avoid this problem, one normally trains a system with data representing all the possible acoustic environments. However, it is often difficult to anticipate all the different acoustic and channel conditions and, moreover, such a pooled system often becomes too large and, hence, computationally burdensome.
Earlier techniques to adapt the acoustic models to a specific environment may be roughly classified into xe2x80x9cmodel transformationxe2x80x9d and xe2x80x9cfeature space transformationxe2x80x9d techniques. In these techniques, the test utterance is first decoded with a generic speaker independent system (first pass), and the transcription with errors is used to compute the extent of the mismatch between the generic model and the specific environment.
A specific example of xe2x80x9cmodel transformationxe2x80x9d is MLLR (Maximum Likelihood Linear Regression) as described in C. J. Legetter and P. C. Woodland, xe2x80x9cSpeaker Adaptation of Continuous Density HMM""s Using Multivariate Linear Regression,xe2x80x9d ICSLP 1994, pp. 451-454, the disclosure of which is incorporated by reference herein. MLLR is based on the assumption that the model that is most suitable for transcribing the test speech is related to the generic model by means of a linear transform, i.e., the means and covariances of the gaussians in the transformed model are related to the means and covariances of the gaussians in the generic model by a linear transform. The parameters of the transformation are computed so that the likelihood of the test speech is maximized with the use of the transformed system, and assuming that the first pass transcription is the correct transcription of the test speech.
In xe2x80x9cfeature space transformationxe2x80x9d techniques, the feature space of the test utterance is assumed to be related to the generic feature space through a linear transformation, and the linear transformation is computed, as before, to maximize the likelihood of the test speech under the assumption that the first pass transcription is correct, see, e.g., A. Sankar and C. H. Lee, xe2x80x9cA Maximum-likelihood Approach to Stochastic Matching for Robust Speech Recognition,xe2x80x9d IEEE Trans., ASSP, 1995, the disclosure of which is incorporated by reference herein.
Other techniques to implement xe2x80x9cfeature space transformationxe2x80x9d also exist, for example, see L. Neumeyer and M. Weintraub, xe2x80x9cProbabilistic Optimum Filtering for Robust Speech Recognition,xe2x80x9d ICASSP, 1994, pp. 417-420; and F. H. Liu, A. Acero and R. M. Stern, xe2x80x9cEfficient Joint Compensation of Speech for the Effect of Additive Noise and Linear Filtering,xe2x80x9d ICASSP, 1992, the disclosures of which are incorporated by reference herein. These techniques do not require a first pass decoding, but they do have the computational overhead of vector quantizing the acoustic space, and finding the center that is closest to each test feature vector.
The present invention provides rapid, computationally inexpensive, nonlinear transformation methods and apparatus for adaptation of speech recognition systems to new acoustic conditions. The methodologies of the present invention may be considered as falling under the category of xe2x80x9cfeature space transformationxe2x80x9d techniques. Such inventive techniques have the advantage of being computationally much less inexpensive than the conventional techniques described above as the techniques of the invention do not require a first pass decoding or a vector quantization computation.
Generally, the invention provides equalization via cumulative distribution function matching between training acoustic data and test acoustic data. The acoustic data is preferably in the form of cepstral vectors, although spectral vectors or even raw speech samples may be used. The present invention represents a more powerful and flexible transformation as the mapping of the test feature to the space of the training features is not constrained to be linear.
In an illustrative aspect of the invention, a method of adapting a speech recognition system to one or more acoustic conditions, the method comprising the steps of: (i) computing cumulative distribution functions based on dimensions of speech vectors associated with training speech data provided to the speech recognition system; (ii) computing cumulative distribution functions based on dimensions of speech vectors associated with test speech data provided to the speech recognition system; (iii) computing a nonlinear transformation mapping based on the cumulative distribution functions associated with the training speech data and the cumulative distribution functions associated with the test speech data; and (iv) applying the nonlinear transformation mapping to speech vectors associated with the test speech data prior to recognition, wherein the speech vectors transformed in accordance with the nonlinear transformation mapping are substantially similar to speech vectors associated with the training speech data.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.