ASR technologies enable microphone-equipped computing devices to interpret speech and thereby provide an alternative to conventional human-to-computer input devices such as keyboards or keypads. A typical ASR system includes several basic elements. A microphone and an acoustic interface receive a user's utterance and digitize the utterance into acoustic data. An acoustic pre-processor parses the acoustic data into information-bearing acoustic features. A decoder uses acoustic models to decode the acoustic features into utterance hypotheses. The decoder generates a confidence value for each hypothesis to reflect the degree to which each hypothesis phonetically matches a subword of each utterance, and to select a best hypothesis for each subword.
Speech recognition performance suffers when there is a mismatch between a sampling rate of incoming speech, and a sampling rate used in creating the acoustic models. For example, telephonic audio systems typically use an 8 kHz sampling rate over a 4 kHz spectral range, and automotive ASR systems normally use a 16 kHz sampling rate over an 8 kHz spectral range. Thus, when a higher resolution 16 kHz ASR system receives lower resolution 8 kHz audio, the incoming audio will include acoustic features for a spectral range of 0 to 4 kHz, but will lack acoustic features from an upper spectral range of 4 to 8 kHz. Because the ASR acoustic models are built for an overall 0 to 8 kHz spectral range, the lack of upper range acoustic features degrades recognition performance, especially of fricative speech.
Therefore, according to current ASR implementations, different sets of acoustic models are empirically developed for different sampling rates. But this approach involves multitudes of different and unnecessarily complex acoustic models, thereby possibly delaying model development, increasing required computing memory and power, and yielding an unacceptable level of latency in recognition.