The present invention relates to methods and apparatus for processing speech signals, and more particularly for methods and apparatus for removing channel distortion in speech systems such as speech and speaker recognition systems.
Cepstral mean normalization (CMN) is an effective technique for removing communication channel distortion in automatic speaker recognition systems. To work effectively, the speech processing windows in CMN systems must be very long to preserve phonetic information. Unfortunately, when dealing with non-stationary channels, it would be preferable to use smaller windows that cannot be dealt with as effectively in CMN systems. Furthermore, CMN techniques are based on an assumption that the speech mean does not carry phonetic information or is constant during a processing window. When short windows are utilized, however, the speech mean may carry significant phonetic information.
The problem of estimating a communication channel affecting a speech signal falls into a category known as blind system identification. When only one version of the speech signal is available (i.e., the xe2x80x9csingle microphonexe2x80x9d case), the estimation problem has no general solution. Oversampling may be used to obtain the information necessary to estimate the channel, but if only one version of the signal is available and no oversampling is possible, it is not possible to solve each particular instance of the problem without making assumptions about the signal source. For example, it is not possible to perform channel estimation for telephone speech recognition, when the recognizer does not have access to the digitizer, without making assumptions about the signal source.
One configuration of the present invention therefore provides a method for blind channel estimation of a speech signal corrupted by a communication channel. The method includes converting a noisy speech signal into either a cepstral representation or a log-spectral representation; estimating a temporal correlation of the representation of the noisy speech signal; determining an average of the noisy speech signal; constructing and solving, subject to a minimization constraint, a system of linear equations utilizing a correlation structure of a clean speech training signal, the correlation of the representation of the noisy speech signal, and the average of the noisy speech signal; and selecting a sign of the solution of the system of linear equations to estimate an average clean speech signal over a processing window.
Another configuration of the present invention provides an apparatus for blind channel estimation of a speech signal corrupted by a communication channel. The apparatus is configured to convert a noisy speech signal into either a cepstral representation or a log-spectral representation; estimate a temporal correlation of the representation of the noisy speech signal; determine an average of the noisy speech signal; construct and solve, subject to a minimization constraint, a system of linear equations utilizing a correlation structure of a clean speech training signal, the correlation of the representation of the noisy speech signal, and the average of the noisy speech signal; and select a sign of the solution of the system of linear equations to estimate an average clean speech signal over a processing window.
Yet another configuration of the present invention provides a machine readable medium or media having recorded thereon instructions configured to instruct an apparatus including at least one of a programmable processor and a digital signal processor to: convert a noisy speech signal into a cepstral representation or a log-spectral representation; estimate a temporal correlation of the representation of the noisy speech signal; determine an average of the noisy speech signal; construct and solve, subject to a minimization constraint, a system of linear equations utilizing a correlation structure of a clean speech training signal, the correlation of the representation of the noisy speech signal, and the average of the noisy speech signal; and select a sign of the solution of the system of linear equations to estimate an average clean speech signal over a processing window.
Configurations of the present invention provide effective and efficient estimations of speech communication channels without removal of phonetic information.
Further areas of applicability of the present invention will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.