Speech recognition systems are designed to undertake the difficult task of extracting recognized speech from an audio signal, e.g., a natural language signal. The speech recognizer within such speech recognition systems must account for diverse acoustic characteristics of speech such as vocal tract size, age, gender, dialect, and the like. Artificial recognition systems are typically implemented using powerful processors with large memory capacity to handle the various complex algorithms that must be executed to extract the recognized speech.
To further complicate the complex speech recognition process, the audio signal is often obtained or extracted from a noisy environment, e.g., an audio signal captured in a moving vehicle or in a crowded restaurant, thereby compromising the quality of the input audio signal. To address the noisy background or environmental contamination, the speech recognizer can be implemented with various noise compensation algorithms.
Noise compensation schemes include the Parallel Model Combination (PMC) and other model adaptation techniques. However, these schemes often require large amounts of memory and are computationally intensive. To illustrate, the PMC method is a method of adding and synthesizing a Hidden Markov Model (HMM) (speech HMM) learned by speech collected and recorded in a noiseless environment and an HMM (noise HMM) learned by noise. In the noise process of the PMC, it is presumed that additiveness of noise and speech is established in a linear spectrum region. In contrast, in the HMM, parameters of a logarithm spectrum system, such as a cepstrum and the like, are often used as a characteristic amount of the speech. According to the PMC method, those parameters are converted into the linear spectrum region and then are added and synthesized in the linear spectrum region of the characteristic amount, which is derived from the speech HMM and noise HMM. After the speech and the noise are synthesized, an inverse operation is performed to return the synthesized value from the linear spectrum region to the cepstrum region, thereby obtaining a noise superimposed speech HMM. However, although the PMC is effective in addressing additive noise, the PMC method is very computationally expensive because the nonlinear conversion is executed to all of the models. Namely, the amount of calculations is very large, the processing time is very long, and it may not be suitable for a real time application or a portable application where processing resources and memory capacity are limited.
Therefore, a need exists for a fast and computationally inexpensive method that addresses the problem of speech recognition in noisy environments without the need of any prior recognition pass or large memory capacity.