Speech recognition servers can receive and recognize voice input. Typically, speech recognition servers reside in a cloud-based computing resource and receive the input sent to them over a wired and/or wireless network(s) in real-time. Some mobile devices may have the user press a button to signal the mobile device to activate speech recognition. After the speech recognition is activated, the user can speak to the device. Various mobile devices allow the user to use a wakeup keyword to activate the speech recognition (e.g., “ok Google”) on the mobile device. In response to a command uttered by the user (e.g., “when is the next 49ers game?”), the user can expect a quick response.
Users sometimes have to utter commands in noisy conditions, such as when there are other voices in the background. In such conditions, the speech recognition (SR) engine may receive the microphone input that includes speech from both the speaker (the user), as well as speech from other speakers speaking in the background. Accordingly, the SR engine may not recognize the speech of the speaker accurately.
In particular, an issue may arise with some pre-processing algorithms using multiple microphones and taking time to adjust parameters to optimal values when a voice comes from a new direction. This can occur, for example, when a user changes his/her position relative to the device (for example, the user moves to a different part of a room relative to a tablet or a TV set, or changes his/her hand orientation while holding a cellphone). When a talker (speaker) first speaks from the new position, the processor/algorithm can adapt many of its internal parameters to account for this (for example, direction of arrival estimate, either explicitly or implicitly, noise estimates) and then settle on optimal parameters for the new orientation. During this transitional time, however, the processing is not optimal and may even degrade the speech signal. Thus, the beginning of the utterance can be distorted or, at best, the speech can be processed with less noise removed until the processor/algorithm settles on the optimal parameters.