1. Field of the Invention
This invention relates generally to electronic speech recognition systems, and relates more particularly to a method for performing microphone conversions in a speech recognition system.
2. Description of the Background Art
Implementing an effective and efficient method for system users to interface with electronic devices is a significant consideration of system designers and manufacturers. Automatic speech recognition is one promising technique that allows a system user to effectively communicate with selected electronic devices, such as digital computer systems. Speech typically consists of one or more spoken utterances which each may include a single word or a series of closely-spaced words forming a phrase or a sentence.
An automatic speech recognizer typically builds a comparison database for performing speech recognition when a potential user xe2x80x9ctrainsxe2x80x9d the recognizer by providing a set of sample speech. Speech recognizers tend to significantly degrade in performance when a mismatch exists between training conditions and actual operating conditions. Such a mismatch may result from various types of acoustic distortion. One source that may create acoustic distortion is the presence of convolutive distortions due to the use of various different microphones during training process and the actual speech recognition process.
Referring now to FIG. 1(a), an exemplary waveform diagram for one embodiment of speech 112 recorded an original training microphone is shown. In addition, FIG. 1(b) depicts an exemplary waveform diagram for one embodiment of speech 114 recorded with a final microphone used in the actual speech recognition process. In practice, speech 112 of FIG. 1(a) and speech 114 of FIG. (1(b) typically exhibit mismatched characteristics, even when recording an identical utterance. This mismatch typically results in significantly degraded performance of a speech recognizer. In FIGS. 1(a) and 1(b), waveforms 112 and 114 are presented for purposes of illustration only. A speech recognition process may readily incorporate various other embodiments of speech waveforms.
From the foregoing discussion, it therefore becomes apparent that compensating for various different microphones a significant consideration of designers and manufacturers of contemporary speech recognition systems.
In accordance with the present invention, a method is disclosed for performing microphone conversions in a speech recognition system. In one embodiment of the present invention, initially, a speech module preferably captures the same input signal with an original microphone, and also simultaneously captures the same input signal with a final target microphone. In certain embodiments, the foregoing two recorded versions of the same input signal may be stored as speech data in a memory device.
The speech module preferably then accesses the recorded input signals using a feature extractor that separately processes the recorded input signals as recorded by the original microphone, and also as recorded by the final target microphone. A characterization module may preferably then perform a characterization process by analyzing the two versions of the same recorded input signal, and then responsively generating characterization values corresponding to the original microphone and the final microphone.
In certain embodiments, the characterization module may perform the foregoing characterization process by accessing the recorded input data as it is processed by the feature extractor in a frequency-energy domain following a fast Fourier transform procedure. In certain other embodiments, the characterization module may perform the foregoing characterization process further downstream by accessing the recorded input data as it is processed by the feature extractor in a cepstral domain following a frequency cosine transform process.
The speech module preferably then utilizes the feature extractor to process an original training database that was initially recorded using the original microphone. Next, a conversion module preferably may convert the original training database into a final training database by utilizing the characterization values that were previously generated by the characterization module.
A recognizer training program may then utilize the final training database to train a recognizer in the speech module. Finally, the speech module may advantageously utilize the trained recognizer in a speech recognition system that utilizes the final microphone to capture input data for optimized speech recognition, in accordance with the present invention. The present invention thus efficiently and effectively performs microphone conversions in a speech recognition system.