1. Field of the Invention
The present invention can be concerned with the technical field of pattern recognition and specifically speech recognition. More particularly, the present invention can be concerned with speech recognition in noisy environments.
2. Description of the Related Art
Speech recognition is a technique which relies on the use of trained models such as Hidden Markov Models (HMMs) to decode an audio speech signal into recognisable words which can either be displayed or further processed. Further processing may include outputting the text into a language translation device or converting it into an understandable instruction for use voice controlled apparatus.
Generally, the models are trained in a noise-free environment. However, in use, the systems are generally used in relatively noisy environments compared to the laboratory training environment.
Two successful techniques have been developed for speech recognition in noisy environments. The first of these is the vector Taylor series (VTS) method. The VTS method is described in Acero et al: “HMM adaptation using vector Taylor series for noisy speech recognition”, In ICSLP-2000, vol. 3, 869-872. The VTS method compensates the HMM on each Gaussian mixture level. The system uses the mean value of each mixture as the Taylor extension points and calculates the Taylor extension matrices for each mixture. The likelihood during recognition is then expressed as:p(yIm)=N(y; μym; Σym)  (a)where p(yIm) is the likelihood of the Gaussian mixture m from the noisy speech feature y, μy and Σy are the Gaussian mixture mean and variance.
In the VTS, it is assumed that the relationship between noisy and clean features is as follows:y=x+h+g(x, n, h)=x+h+C ln(1+eC−1(n−x−h))  (b)where y is the noisy speech feature, x the corresponding clean speech feature, C the discrete cosine transform matrix and n and h the static features for additive and convolutional noise respectively.
Given a Taylor extension point (xe, ne, he), the above non-linear relation can be linearly approximated by the first-order Taylor series as:y≈xe+he+g(xe, ne, he)+W(x−xe)+(I−W)g(xe, ne, he)(n−ne)+W(h−he)W=I+∇xg(xe, ne, he)  (c)
By using the above relations, it is possible to relate the mean and variance of a Gaussian for clean speech to the mean and variance of a Gaussian for noisy speech. This can be done for the static, delta and delta-delta parts of the received signal. By applying these conversions, it is possible to adapt the trained clean model for the noisy environment.
The above method suffers from the problem in that it is computationally very laborious since conversion parameters need to be calculated for each Gaussian in the HMM. Generally, in this procedure, only the first order Taylor series expansion is used.
An alternative method is the so-called joint uncertainty decoding (JUD) method, which is described in Liao, H./Gales, M. J. F. (2005): “Joint uncertainty decoding for noise robust speech recognition”, In INTERSPEECH-2005, 3129-3132. The JUD method calculates the output probability for the mixture m as follows:p(Y|m)=|Ar|N(ArY+br; Λxm, Ξxm+Ξbr)  (d)
It is assumed that mixture m belongs to the rth regression class the method is performed in a class-by-class basis. This means that the JUD transforms relate to the same regression class are defined as:Ar=Ξxr(Ξyxr)−1, br=Λxr−ArΛyr Ξbr=ArΞyrArT−Ξxr  (e)
Where Λxr, Ξrx, Λyr, and Ξry are respectively the mean and covariance for clean and noisy speech in regression class r, and Ξryx is the cross covariance matrix. The calculation of Ξryx is costly from a computational point of view and is often approximated by a first order Taylor expansion.
After manipulating the results, it can be seen that the JUD is essentially equivalent to the VTS method since they both involve the calculation of a first order Taylor series. However, in the VTS, the first order calculation is performed for each Gaussian whereas in the JUD, it is performed for each regression class. This means that the JUD method is computationally advantageous over the VTS method. However, the JUD method has considerably lower accuracy than VTS.