1. Field of the Invention
The present invention relates to an automatic speech recognition method and to an automatic speech recognition apparatus, using non-linear envelope detection of signal power spectra.
2. Description of the Related Art
Automatic Speech Recognition (ASR) systems are designed to automatically associate words selected from a given dictionary to acoustic speech signals generated by either a human or a synthetic speaker. ASR systems are normally provided with equipment to process speech signals by performing so-called front-end steps and back-end steps.
Front-end processing generally includes speech signal detection and extraction of features on which recognition is to be based. In back-end processing, extracted features are compared to speech patterns and, based on decision rules, words are associated with detected speech signals.
Due to learning capability and training, ASR systems can be adapted to operate in specific configurations, e.g., with specific sets of acoustic sensors or in specific environmental conditions. Quite high rates of correctly recognized words may be thus achieved in the absence of interferences.
However, an acoustic sensor may show an ill-behaved frequency response and noise sources may be present, such as voiced or unvoiced noises, echoes and reverberations. These and other similar conditions may seriously affect the performance of ASR system.
Front-end and back-end techniques have been proposed to reduce sensitivity of ASR systems to interferences. Back-end techniques aim at providing robust characterization of an acoustic model and include model adaptation techniques (such as Maximum Likelihood Linear Regression or Maximum A Posteriori Estimate), model compensation techniques and robust statistics. Examples of front-end techniques, which normally require less intensive computational effort, include methods based on Mel Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding and Perceptive Linear Predictive Coding. In particular, MFCC-based techniques require spectral power estimation for the speech signals and power spectrum envelope detection. Additive noise more seriously affects regions of low spectral power density (valleys) rather than regions of high spectral power density (peaks), and signal-to-noise ratio is poor in valleys. For this reason, non-linear processing is used for the purpose of noise suppression in power spectrum valleys.
It would be desirable to further adapt existing techniques to more closely simulate the human aural system, thereby effectively exploiting information associated with speech signals.