1. Field of the Invention
This invention relates generally to electronic speech recognition systems, and relates more particularly to a method for performing speech recognition in cyclostationary noise environments.
2. Description of the Background Art
Implementing an effective and efficient method for system users to interface with electronic devices is a significant consideration of system designers and manufacturers. Automatic speech recognition is one promising technique that allows a system user to effectively communicate with selected electronic devices, such as digital computer systems. Speech typically consists of one or more spoken utterances which may each include a single word or a series of closely-spaced words forming a phrase or a sentence.
An automatic speech recognizer typically builds a comparison database for performing speech recognition when a potential user xe2x80x9ctrainsxe2x80x9d the recognizer by providing a set of sample speech. Speech recognizers tend to significantly degrade in performance when a mismatch exists between training conditions and actual operating conditions. Such a mismatch may result from various types of acoustic distortion.
Conditions with significant ambient background-noise levels present additional difficulties when implementing a speech recognition system. Examples of such noisy conditions may include speech recognition in automobiles or in certain other mechanical devices. In such user applications, in order to accurately analyze a particular utterance, a speech recognition system may be required to selectively differentiate between a spoken utterance and the ambient background noise.
Referring now to FIG. 1(a), an exemplary waveform diagram for one embodiment of clean speech 112 is shown. In addition, FIG. 1(b) depicts an exemplary waveform diagram for one embodiment of noisy speech 114 in a particular operating environment. In FIGS. 1(a) and 1(b), waveforms 112 and 114 are presented for purposes of illustration only. A speech recognition process may readily incorporate various other embodiments of speech waveforms.
From the foregoing discussion, it therefore becomes apparent that compensating for various types of ambient noise remains a significant consideration of designers and manufacturers of contemporary speech recognition systems.
In accordance with the present invention, a method is disclosed for performing speech recognition in cyclostationary noise environments. In one embodiment of the present invention, initially, original cyclostationary noise from an intended operating environment of a speech recognition device may preferably be provided to a characterization module that may then preferably perform a cyclostationary noise characterization process to generate target stationary noise, in accordance with the present invention.
In certain embodiments, the original cyclostationary noise may preferably provided to a Fast Fourier Transform (FFT) from the characterization module. The FFT may then preferably generate frequency-domain data by converting the original cyclostationary noise from the time domain to the frequency domain to produce a cyclostationary noise frequency-power distribution. The cyclostationary noise frequency-power distribution may include an array file with groupings of power values that each correspond to a different frequency, wherein the groupings each correspond to a different time frame.
An averaging filter from the characterization module may then access the cyclostationary noise frequency-power distribution, and responsively generate an average cyclostationary noise frequency-power distribution using any effective techniques or methodologies. For example, the averaging filter may calculate an average cyclostationary power value for each frequency of the cyclostationary noise frequency-power distribution across the different time frames to thereby produce the average cyclostationary noise frequency-power distribution which includes stationary characteristics of the original cyclostationary noise.
Next, white noise with a flat power distribution across a frequency range may preferably be provided to the Fast Fourier Transform (FFT) of the characterization module. The FFT may then preferably generate frequency-domain data by converting the white noise from the time domain to the frequency domain to produce a white noise frequency-power distribution that may preferably include a series of white noise power values that each correspond to a different frequency.
A modulation module of the characterization module may preferably access the white noise frequency-power distribution, and may also access the foregoing average cyclostationary noise frequency-power distribution. The modulation module may then modulate white noise power values of the white noise frequency-power distribution with corresponding cyclostationary power values from the average cyclostationary noise frequency-power distribution to advantageously generate a target stationary noise frequency-power distribution.
In certain embodiments, the modulation module may preferably generate individual target stationary power values of the target stationary noise frequency-power distribution by multiplying individual white noise power values of the white noise frequency-power distribution with corresponding individual cyclostationary power values from the average cyclostationary noise frequency-power distribution on a frequency-by-frequency basis. An Inverse Fast Fourier Transform (IFFT) of the characterization module may then preferably generate target stationary noise by converting the target stationary noise frequency-power distribution from the frequency domain to the time domain.
A conversion module may preferably access an original training database that was recorded for training a recognizer of the speech recognition device based upon an intended speech recognition vocabulary of the speech recognition device. The conversion module may then preferably generate a modified training database by utilizing the target stationary noise to modify the original training database. In practice, the conversion module may add the target stationary noise to the original training database to produce the modified training database that then advantageously incorporates characteristics of the original cyclostationary noise to improve performance of the speech recognition device.
A training module may then access the modified training database for training the recognizer. Following the foregoing training process, the speech recognition device may then effectively utilize the trained recognizer to optimally perform various speech recognition functions. The present invention thus efficiently and effectively performs speech recognition in cyclostationary noise environments.