Computing devices can be used to process a user's spoken commands, requests, and other utterances into written transcriptions. In a typical implementation, an automatic speech recognition system may take an audio signal (or data derived therefrom) as input and, using various models, determine the most likely sequence of words spoken by the user. The results can then be used by various applications to initiate a computing task, store a record of the transcription, or the like.
Many automatic speech recognition systems suffer from a lack of robustness when processing utterances in the presence of additional sounds, such as reverberation, acoustic echo, interfering speech, and environmental noise. Reverberation occurs when sound propagates in an environment (e.g., room or other enclosed space), causing a build-up of reflections of the sound. The sound reflections are typically detected within about 30 milliseconds of the original sound (in contrast with acoustic echoes, which are usually detected >30 milliseconds after the original sound). When the sound source is removed, the reflecting sound is absorbed by the environment and the sound level decays. A common measurement of reverberation, known as “RT60,” is the amount time it takes for the intensity of a sound to decay by 60 decibels.
Automatic speech recognition systems configured to process audio signals captured in the presence of a particular level of reverberation may not produce satisfactory results when processing audio signals captured in the presence of a different level of reverberation. To compensate for this limitation, some systems determine the level of reverberation (e.g., the RT60 value) and adjust processing accordingly. For example, a system may play a recording with known acoustic characteristics, and then compute the level of reverberation based on an input signal captured by a microphone during and after playback of the recording. The detected reverberation level can then be used to adjust processing of one or more system components to improve speech recognition performance.