Current automatic speech recognition systems perform reasonably well in laboratory conditions, but degrade rapidly when used in real world applications. One of the important factors influencing recognizer performance in real world applications is the presence of environmental noise that corrupts the speech signal. A number of methods, such as spectral subtraction or parallel model combination, have been developed to address the noise problem. However, these solutions are either too limited or computationally expensive.
Recently, a Jacobian adaptation method has been proposed to deal with additive noise, where the noise changes from noise A to noise B. For example, U.S. Pat. No. 6,026,359 to Yamaguchi describes such a scheme for model adaptation in pattern recognition, based on storing Jacobian matrices of a Taylor expansion that expresses model parameters. However, for this model to perform well, it is necessary to have noise A and noise B close to one another in terms of character and level. For example, the Jacobian adaptation technique is likely to work well where noise A is measured within the passenger compartment of a given vehicle traveling on a smooth road at 30 miles per hour, and where noise B is of a similar character, such as the noise measured insude the same vehicle on the same road traveling at 45 miles per hour.
The known Jacobian adaptation technique begins to fail when noise A and noise B lie far apart from one another, such as when noise A is measured inside the vehicle described above at 30 miles per hour and noise B is measured in the vehicle with windows down or at 60 miles per hour.
This shortcoming of this known Jacobian noise adaptation technique limits its usefulness in many practical applications because it is often difficult to anticipate at training time the noise that may be present at testing time (when the system is in use). Also, improvements in Jacobian noise adaptation techniques are limited in many applications because the computational expense (processing time and/or memory requirements) needed makes them impractical.
Another concern relates to compensation of convolutional noise. Convolutional noise can be distinguished from the above-discussed additive noise in that convolutional noise results from the speech channel. For example, changes in the distance from the speaker to the microphone, microphone imperfections, and even a telephone line over which the signal is transmitted all contribute to convolutional noise. Additive noise, on the other hand, typically results from the environment in which the speaker is speaking.
An important characteristic of convolutional noise is that it is multiplicative when the speech signal is in the spectral domain, whereas additive noise is additive in the spectral domain. These characteristics cause particular difficulties with respect to noise compensation. In fact, most conventional approaches deal either with convolutional noise or additive noise, but not both.