The present invention relates to a speech recognition system comprising an audio input unit arranged in a terminal and a speech recognizer, as well as a speech recognition enhancer for such speech recognition system.
Speech recognition systems are used in a wide application environment with a strong degradation of reliability with noise background. Many applications are needed in poor acoustic environment, for example telemetric systems in cars or vans, speech control systems at station airports and other public fields, and mobile phones in nearly every environment.
To improve the degradation of reliability with noise background, ETSI ES 202 050, V1.1.2 (2003-10) introduces selective spectra substraction methods used for noise reduction.
The input signal from the input audio part of a DSR terminal (DSR=Distributed Speech Recognition) is processed by the terminal front-end of the terminal. The terminal front-end develops a feature vector from a speech wave sampled at different rates, wherein the feature vectors consist of 13 static cepstral coefficient and a log-energy coefficient. In the terminal part, speech features are computed from the input signal in the feature extraction part. Then, features are compressed and further processed for general transmission through the server side. In the feature extraction part, noise reduction is performed first. Then, wave form processing is a applied to the de-noise signal and cepstral features are calculated. At the server side, bit-stream decoding, error mitigation and feature decompression are applied.
Noise reduction is based on a Wiener filter. After framing the input signal, the linear spectrum of each frame is estimated. In a power spectral density mean block, the signals spectrum is smoothed along the time index. Then, in the Wiener filter design block, frequency domain Wiener filter coefficients are calculated by using both the current frame spectrum estimation and the noise spectrum estimation. The noise spectrum is estimated from noise frames, which are detected by a voice activity detector. Linear Wiener filter coefficients are further smoothed along the frequency axis by using a Mel Filter-Bank. The impulse response of this Mel-warped Wiener filter is obtained by applying a Mel-warped inverse discrete cosine transform. Finally, the signal is filtered in an Apply filter block. The input signal of the second stage is the output signal from the first stage, wherein the second stage comprises a spectrum estimation block, a Power Spectral Density Mean block a Wiener filter Design block, a Mel Filter-Bank block, a gained factorization block, a Mel-warped inverse discrete cosine transform block and a Apply filter block.
Disadvantages of such approach improving the degradation by means of selective spectral subtraction methods are the high computation and memory efforts and the inflexibility of the system.