Speech recognition is generally well understood in the art. Generally speaking, audio samples (which may or may not contain actual speech) are digitized to facilitate digital processing. Various speech recognition features (such as, for example, cepstral coefficients) are extracted from the digitized audio samples and at least some of these features are used in a pattern matching process to facilitate recognition of any speech content that may be contained in the audio samples.
Many prior art speech recognition approaches work relatively successfully in a laboratory setting where little or no ambient noise exists. Unfortunately, when used during more normal operating conditions, and particularly when used where audible ambient noise (i.e., non-speech content) exists, the performance of such approaches frequently suffers greatly. As a result, recognition reliability can drop considerably.
To attempt to mitigate the impact on recognition of such noise, many prior art suggestions revolve about trying to suppress the noise content before extracting the speech recognition features and/or conducting the pattern matching. So-called Wiener filtering and spectral subtraction, for example, seek to estimate the noise contribution (at least within the power spectrum) and to then effectively subtract that contribution from the sample.
Such approaches tend to treat noise as a relatively simple phenomena (for example, these prior art techniques usually assume that noise is stationary over at least short periods of time and further assume that noise constitutes a purely additive component of the final audio sample) when, more usually, noise behaves in more complicated and unpredictable ways. As a result, the effectiveness of such prior art approaches can vary considerably (and often unpredictably) from moment to moment, platform to platform, and environment to environment.
In general, the contribution of ambient noise continues to comprise a stubborn and significant known problem that acutely impacts the reliability of most or all speech recognition techniques.
Another prior art problem stems from the speech recognition feature extraction process itself. In general, many such techniques process the spectral content of the audio sample in a way that permits noisy or otherwise unreliable content in a small portion of the sample to ultimately influence the accuracy of most or even all of the resultant extracted features. For example, the cosine transformation that characterizes the extraction of cepstral coefficients readily translates noisy errors in a small portion of the sample throughout all of the corresponding coefficients for that sample.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of various embodiments of the present invention. Also, common but well-understood elements that are useful or necessary in a commercially feasible embodiment are typically not depicted in order to facilitate a less obstructed view of these various embodiments of the present invention.