1. Field of the Invention
The present invention generally relates to systems that automatically determine the position of a desired audio source, such as a talker, based on audio input received via an array of microphones.
2. Background
As used herein, the term audio source localization refers to a technique for automatically determining the position of a desired audio source, such as a talker, in a room or other area. FIG. 1 is a block diagram of an example system 100 that performs audio source localization. System 100 may represent, for example and without limitation, a speakerphone, an audio teleconferencing system, a video game system, or other system capable of both capturing and playing back audio signals.
As shown in FIG. 1, system 100 includes receive processing logic 102 that processes an audio signal for playback via speakers 104. The audio signal processed by receive processing logic 102 may be received from a remote audio source such as a far-end speaker in a speakerphone or audio teleconferencing scenario. Additionally or alternatively, the audio signal processed by receive processing logic 102 may be generated by system 100 itself or some other source connected locally thereto.
As further shown in FIG. 1, system 100 further includes an array of microphones 106 that converts sound waves produced by local audio sources into audio signals. These audio signals are then processed by audio source localization logic 108. In particular, audio source localization logic 108 periodically processes the audio signals generated by microphone array 106 to estimate a current position of a desired audio source 114. Desired audio source 114 may represent, for example, a near-end talker in a speakerphone or audio teleconferencing scenario. The estimated current position of desired audio source 114 as determined by audio source localization logic 108 may be defined, for example, in terms of an estimated current direction of arrival of sound waves emanating from desired audio source 114.
System 100 also includes a steerable beamformer 110 that is configured to process the audio signals generated by microphone array 106 to produce a single output audio signal. In producing the output audio signal, steerable beamformer 110 performs spatial filtering based on the estimated current position of desired audio source 114 such that signal components attributable to sound waves emanating from positions other than the estimated current position of desired audio source 114 are attenuated relative to signal components attributable to sound waves emanating from the estimated current position of desired audio source 114. This tends to have the beneficial effect of attenuating undesired audio sources relative to desired audio source 114, thereby improving the overall quality and intelligibility of the output audio signal. In a speakerphone or audio teleconferencing scenario, the output audio signal produced by steerable beamformer 110 is transmitted to a far-end listener.
The information produced by audio source localization logic 108 may also be useful for applications other than steering a beamformer used for acoustic transmission. For example, the information produced by audio source localization logic 108 may be used in a video game system to integrate the estimated current position of a player within a room into the context of a game. Various other beneficial applications of audio source localization also exist. These applications are generally represented in system 100 by the element labeled “other applications” and marked with reference numeral 112.
One problem for system 100 and other systems that perform audio source localization is the presence of acoustic echo 116. Acoustic echo 116 is generated when system 100 plays back audio signals, an echo of which is picked up by microphone array 106. In a speakerphone or audio teleconferencing system, such echo may be attributable to speech signals representing the voices of one or more far end speakers that are played back by the system. In a video game system, acoustic echo may also be attributable to music, sound effects, and/or other audio content produced by a game as well as the voices of other players when online interaction with remote players is supported. It is noted, however, that many systems exist that implement audio source localization but do not play back audio signals. For these systems, the presence of acoustic echo is not an issue.
Another problem for system 100 and other systems that perform audio source localization is the presence of noise and/or interference 118 in the environment of desired audio source 114. As used herein, the term noise generally refers to undesired audio that tends to be stationary in nature while the term interference generally refers to undesired audio that tends to be non-stationary in nature.
The presence of echo, noise and/or interference can cause audio source localization logic 108 to perform poorly, since the logic may not be able to adequately distinguish between desired audio source 114 whose position is to be determined and the echo, noise and/or interference. This may cause audio source localization logic 108 to incorrectly estimate the current position of desired audio source 114.
One known technique for performing audio source localization is termed the Steered Response Power (SRP) technique. SRP is widely considered to be the most robust approach for performing audio source localization in the presence of noise. SRP typically involves using a microphone array to steer beams generated using the well-known delay-and-sum beamforming technique so that the beams are pointed in different directions in space (referred to herein as the “look” directions of the beams). The delay-and-sum beams may be spectrally weighted. The look direction associated with the delay-and-sum beam that provides the maximum response power is then chosen as the direction of arrival of sound waves emanating from the desired audio source. The delay-and-sum beam that provides the maximum response power may be determined, for example, by finding the index i that satisfies:
            argmax      i        ⁢                  ∑        f            ⁢                                                                              B                i                            ⁡                              (                                  f                  ,                  t                                )                                                          2                ·                  W          ⁡                      (            f            )                                ,            for      ⁢                          ⁢      i        =          1      ⁢                          ⁢      …      ⁢                          ⁢      n        ,wherein n is the total number of delay-and-sum beams, Bi(f,t) is the response of delay-and-sum beam i at frequency f and time t, |Bi(f,t)|2 is the power of the response of delay-and-sum beam i at frequency f and time t, and W(f) is a spectral weight associated with frequency f. Note that in this particular approach the response power constitutes the sum of a plurality of spectrally-weighted response powers determined at a plurality of different frequencies.
There are certain problems associated with using SRP, as that technique is conventionally implemented, for performing audio source localization. For example, delay-and-sum beams are often not directive enough to provide good spatial resolution. To help illustrate this, FIG. 2 shows the level of response of five delay-and-sum beams having different look directions as a function of sound wave direction of arrival. The beams were generated using a linear array of five microphones spaced 2 cm apart and the response was measured at 1000 Hertz (Hz). The look direction of each beam and the directions of arrival are expressed in terms of angular difference from a reference direction, which in this case is the broadside direction of the microphone array. As shown in FIG. 2, the relevant delay-and-sum beams have look directions corresponding to −90°, −45°, 0°, 45° and 90°. The curves show that the delay-and-sum beams lead to correct SRP properties in that the maximum response power for each beam is obtained at or around the look direction of the beam.
However, the curves also show that the delay-and-sum beams provide poor directivity. For example, three out of the five delay-and-sum beams provide levels of response that are within about 0.3 dB for directions of arrival near+/−30°. Such a small separation between response levels can lead to problems identifying the best beam if the resolution used to represent response levels and/or the difference between response levels is too coarse. Furthermore, to accommodate such small separation in response levels, systems may be implemented that use very small thresholds to determine when to switch between beams. In such systems, minor variability in response levels can lead to frequent and undesired switching between beams.
In addition to failing to provide good spatial resolution, delay-and-sum beams are generally not the type of beams used for performing acoustic transmission as performed by, for example, many speakerphones and audio teleconferencing systems. Rather, minimum variance distortionless response (MVDR) beams or other super-directive beams that are more suitable for reducing noise, interference and/or acoustic coupling with a loudspeaker are often used. Thus, if SRP is performed to determine the direction of arrival of sound waves emanating from a desired audio source and that direction of arrival is then used to steer an MVDR beamformer or other super-directive beamformer for the purposes of acoustic transmission, then it will be impossible to know in advance what consequence an audio source localization error will have on the quality of the audio signal obtained for acoustic transmission.
What is needed, then, is a system for performing audio source localization that addresses one or more of the aforementioned shortcomings associated with conventional approaches.