Speech recognition is a technique which relies on the use of trained models such as Hidden Markov Models (HMMs) to decode an audio speech signal into recognisable words which can either be displayed or further processed. Further processing may include outputting the text into a language translation device or converting it into an understandable instruction for use voice controlled apparatus.
Generally, the models are trained in a noise-free environment. However, in use, the systems are generally used in relatively noisy environments compared to the laboratory training environment.
Two successful techniques have been developed for speech recognition in noisy environments. The first of these is the vector Taylor series (VTS) method. The VTS method is described in Acero et al: “HMM adaptation using vector Taylor series for noisy speech recognition”, In ICSLP-2000, vol. 3, 869-872. The VTS method compensates the HMM on each Gaussian mixture level. The system uses the mean value of each mixture as the Taylor extension points and calculates the Taylor extension matrices for each mixture. The likelihood during recognition is then expressed as:p(y|m)=N(y;μym;Σym)  (a)where p(y|m) is the likelihood of the Gaussian mixture m from the noisy speech feature y, μy and Σy are the Gaussian mixture mean and variance.
In the VTS, it is assumed that the relationship between noisy and clean features is as follows:y=x+h+g(x,n,h)=x+h+C ln(1+eC−1(n−x−h))  (b)where y is the noisy speech feature, x the corresponding clean speech feature, C the discrete cosine transform matrix and n and h the static features for additive and convolutional noise respectively.
Given a Taylor extension point (xe, ne, he), the above non-linear relation can be linearly approximated by the first-order Taylor series as:y≈xehe+g(xe,ne,he)+W(x−xe)+(I−W)g(xe,ne,he)(n−ne)+W(h−he)W=I+∇xg(xe,ne,he)  (c)
By using the above relations, it is possible to relate the mean and variance of a Gaussian for clean speech to the mean and variance of a Gaussian for noisy speech. This can be done for the static, delta and delta-delta parts of the received signal. By applying these conversions, it is possible to adapt the trained clean model for the noisy environment.
The above method suffers from the problem in that it is computationally very laborious since conversion parameters need to be calculated for each Gaussian in the HMM. Generally, in this procedure, only the first order Taylor series expansion is used.
An alternative method is the so-called joint uncertainty decoding (JUD) method, which is described in Liao, H./Gales, M. J. F. (2005): “Joint uncertainty decoding for noise robust speech recognition”, In INTERSPEECH-2005, 3129-3132. The JUD method calculates the output probability for the mixture m as follows:p(Y|m)=|Ar|N(ArY+br;Λxm,Ξxm+Ξbr)  (d)
It is assumed that mixture m belongs to the rth regression class the method is performed in a class-by-class basis. This means that the JUD transforms relate to the same regression class are defined as:Ar=Ξxr(Ξyxr)−1,br=Λxr−ArΛyr Ξbr=ArΞyrArT−Ξxr  (e)
Where Λxr, Ξrx, Λyr, and Ξry are respectively the mean and covariance for clean and noisy speech in regression class r, and Ξryx is the cross covariance matrix.
As JUD transforms are usually obtained by Taylor expansion, JUD is the same as VTS except that JUD only computes Taylor expansion on each regression classes. Furthermore, most compensations in JUD are applied on feature vectors instead of HMM parameters which makes the adaptation process independent of the size of HMM. Therefore JUD is much faster than VTS on adaptation.
However, one problem for JUD is the difficulty of applying non-diagonal transforms because it results in non-diagonal covariance matrices for decoding and the computational cost becomes extremely high. As a consequence, JUD implementation often uses diagonal transforms and the performance is observed to be much worse than VTS.