User interfaces have traditionally relied on input devices such as keyboards, which require physical manipulation by a user. Increasingly, however, it is desired to detect and respond to more natural user input such as speech. Indeed, automated speech recognition has become a viable technology in certain environments, allowing users to provide spoken input to computerized systems. However, automatic speech recognition can be challenging in spacious environments and other environments where it is difficult to isolate the voice of a user from other noises—including ambient noises and the voices of other users.
Audio source separation may be used in some situations in an attempt to produce an audio signal that is focused on a particular area of the environment. For example, multiple microphones may be distributed throughout an environment in order to obtain audio signals from corresponding regions of the environment, and input from the microphones may be selected to emphasize a certain area. As another example, multiple directional microphones may be used to generate audio signals corresponding to different parts of an environment, and a particular area can be chosen by selecting one or more of the directional microphones. In some cases, microphone directionality may be dynamically configured using beamforming techniques in conjunction with a microphone array.
In situations where an audio signal may be tuned or selected to emphasize sound from different parts or areas of an environment, it may be possible to isolate spoken audio by detecting the location of the user within the environment and configuring the audio signal to focus on that location. If the user can be reliably located, this technique can improve the results of automatic speech recognition. However, it can be difficult to identify the user's location, particularly in situations where the user may be moving or where multiple users may speak at different times or even at the same time.