This invention relates to audio signal processing and, in particular, to a circuit that estimates direction of arrival using plural microphones.
As used herein, “telephone” is a generic term for a communication device that utilizes, directly or indirectly, a dial tone from a licensed service provider. For the sake of simplicity, the invention is described in the context of a telephone but has broader utility; e.g., communication devices that do not utilize a dial tone, such as radio frequency transceivers or intercoms.
This present disclosure finds use in many applications where the internal electronics are essentially the same but the external appearance of the device is different. FIG. 1 illustrates a conference phone or speaker phone 10 such as found in business offices. Telephone 10 may include a plurality of microphones 11, 12, 13, and a speaker 15 in a sculptured case.
FIG. 2 illustrates what is sometimes referred to as a hands-free kit 20 for providing audio coupling to a cellular telephone (not shown). Hands-free kits come in a variety of implementations but generally include a case 16, a powered speaker 17 and a plug 18, which may couple to an accessory outlet or a cigarette lighter socket in a vehicle. Case 16 may contain more than one microphone or one of the microphones (not shown) may be separate and may plug into case 16. The external microphone may be for placement as close to a user as possible, e.g., clipped to a visor in a vehicle. Hands-free kit 20 may also include a cable for connection to a cellular telephone or have a wireless connection, such as a BLUETOOTH® interface, for example. A hands-free kit in the form of a head set may be powered by internal batteries but may be electrically similar to the apparatus illustrated in FIG. 2.
Communication with telephones, hands-free devices, and other communication systems are often attempted in noisy acoustical environments. For example, communications with a multi-microphone telephone (e.g., telephone 10) may be in a conference room or office with poor acoustics, with significant background noise. Hands-free kits (e.g., hands-free kit 20) may often be used in even more harsh acoustic environments, such as automobiles, airports, and restaurants.
As used herein, “noise” refers to any unwanted sound, whether or not the unwanted sound is periodic, purely random, or somewhere in between. As such, noise includes background music, voices (herein referred to as “babble”) of people other than the desired speaker, tire noise, wind noise, etc.
Many digital signal processing techniques have been proposed for reducing noise. In products with a single microphone, reducing noise is quite difficult when the desired speech and the noise share the same frequency spectrum. It is difficult for these techniques to remove noise without damaging the desired speech. However, if the origin of the noise and the origin of the desired speech are spatially separated, then one can theoretically extract a clean speech signal from a noisy speech signal. One approach to spatially separating the origin of noise and the origin of desired speech is known as beam forming. Beam forming may be employed in a communication device having two or more microphones. With beam forming, one or more beams may be formed by a processing device (e.g., microprocessor, digital signal processor, etc.) of the communication device, wherein each beam acts as a spatial filter that passes acoustic energy from some spatial directions while filtering out acoustic energy from other directions. By forming a beam that points at or near a desired source of acoustic energy (e.g., a person who is speaking), the desired acoustic energy of the speaker may be passed by the spatial filter implemented by a beam while acoustic energy from noise sources or reflections of the desired source may be rejected or attenuated. In this manner, audio quality of the communication device may be improved.
Such improvement in audio quality may only be realized if the beam is pointed at or near the desired source (or alternatively, if the null of the beam is pointed at or near a noise source). However, this presents challenges in a speakerphone or video conference environment, as a location of desired source (e.g., a person talking) may not be known ahead of time and some method of desired source localization may be needed. This localization often takes the form of a “direction of arrival” estimation, wherein the angle of arrival of the desired acoustic energy (or the undesired noise) is estimated. In a speakerphone or videoconference environment, a desired source may move, or another desired source in a different spatial location may also exist (e.g., a second person begins speaking in another part of a room). Accordingly, desired sources must be tracked to maintain the beam pointing in a correct direction.
An existing approach to desired source location is cross-correlation. In cross-correlation, the delay between receipt of sounds at various microphones is calculated, and, because microphone geometry is typically known in advance, the direction of arrival may be determined based on such delay. However, cross-correlation may have many deficiencies. First, cross-correlation may be expensive, especially if there are more than two microphones, because a cross-correlation must be performed between each microphone and all other microphones, requiring significant processing resources. In addition, cross-correlation typically has significant latency, which impacts the rate at which a desired source can be tracked or the beam switched to another desired source. Furthermore, cross-correlation suffers from variation in spatial resolution, in that cross-correlation resolves desired source location when the desired source is about the same distance from each microphone, but as the desired source moves closer to one microphone, such resolution diminishes.