ASR technologies enable microphone-equipped computing devices to interpret speech and thereby provide an alternative to conventional human-to-computer input devices such as keyboards or keypads. A typical ASR system includes several basic elements. A microphone and an acoustic interface receive an utterance of a word from a user, and digitize the utterance into acoustic data. An acoustic pre-processor parses the acoustic data into information-bearing acoustic features. A decoder uses acoustic models to decode the acoustic features into utterance hypotheses. The decoder generates a confidence value for each hypothesis to reflect the degree to which each hypothesis phonetically matches a subword of each utterance, and to select a best hypothesis for each subword. Using language models, the decoder concatenates the subwords into an output word corresponding to the user-uttered word.
Users of ASR systems sometimes utter commands to an ASR system before the system is ready to receive the command. For example, a user activates an ASR system, the system plays back a “Ready” prompt that the system is ready to receive commands, and a short time later the system initiates a listening period during which it is able to receive and record commands. So when users prematurely enunciate a command before the listening period has begun, the system hears only a portion of the uttered command and, thus, has difficulty understanding the utterance.
The present inventors discovered that premature enunciation causes ASR parameters to become maladjusted. ASR decoders assume that a first few frames of acoustic data after the Ready prompt are merely ambient noise. So when those first few frames instead include a partial utterance, actual values for noise suppression, channel compensation, and speech/silence detection parameters diverge from expected parameter values. This divergence causes an extended time-out period including decoder readjustment, and an error response of “Slower Please” followed by replay of the Ready prompt. The present inventors also discovered that the problem is exacerbated by such long delays, which cause users to speak even more prematurely and much louder.