It is generally recognized that most speakerphones provide inferior quality voice to people who are listening at the far end of a conversation. The listeners may experience impairments such as a cave-like sound caused by reverberation, an annoying amount of background noise such as equipment noise (e.g., cell phones ringing, air conditioners, copying machines, and so on), interfering voices from unintended talkers, and the like. In the case of a device that performs speech recognition, the error rate may be caused by reverberation, interfering sounds, and persistent noise. The speech recognition accuracy can be improved by reducing reverberation and reducing or eliminating sounds from interfering talkers.
To achieve performance improvements, it is common to use several microphones in a cooperative way to improve the speech signal. A configuration that uses several microphones is called a microphone array. Microphone arrays may include a number of geometrically arranged microphone sensors for receiving sound signals (such as speech signals) and converting the sound signals to electrical signals. The electrical signals may be digitized by analog-to-digital converters (ADCs) to converting the analog output (sound signals) of the microphone sensor into digital signals, which may be further processed by software that runs on a processor (such as a microprocessor or digital signal processor). Compared with a single microphone, the multiple sound signals received by a microphone array allow for processing such as noise reduction, speech enhancement, sound source separation, de-reverberation, spatial sound recording, and source localization and tracking, and so on. The processed digital signals may be packetized for transmission over communication channels or converted back to analog signals using a digital-to-analog converter (DAC), or may be provided to a speech recognition algorithm to detect human speech. Microphone arrays are typically configured for beamforming, or directional sound signal reception.
Additive microphone arrays are a configuration of microphones that can achieve signal enhancement and noise suppression based on delay-and-sum principles. In some configurations, there may not be a need for a delay element resulting in an additive-only type of processing, and so the phrase “delay and sum” may be used interchangeably with the term “additive.” To achieve better acoustic noise suppression, additive microphone arrays may include a large inter-sensor distance. Additive microphone arrays can be effective when the spacing between the microphones is approximately one half of the wavelength of the signal of interest. Unfortunately, speech is very broadband, spanning many octaves. To be effective at low frequencies the microphone elements have to be spread out so far that the device would be bulky. At high frequencies, the main beam may be very narrow and there will be a lot of strong side lobes. Consequently, additive microphone configurations are limited to a small range of frequencies. An advantage of an additive microphone array, however, is that they are simple to implement and the mere act of adding the microphone signals together reduces the self-noise (sensor noise) of the microphone elements, where the self-noise is caused by uncorrelated electrical noise that emanates from each microphone element.
In contrast, differential microphone arrays (DMAs) allow for small inter-microphone distance, and may be made very compact. DMAs include an array of microphone sensors that are responsive to the spatial derivatives of the acoustic pressure field. A disadvantage of DMAs is that they are sensitive to electrical self-noise that comes from the microphone element. Unlike environmental noise, the microphone sensor noise is inherent to the microphone sensors and therefore is present even in a soundproof environment such as a soundproof booth. In addition, DMAs usually perform equalization to compensate for the fact that taking the difference of the microphone sensors distorts the frequency response, which needs to be inverted to result in a flat frequency response. The equalization is only perfect if the direction of the talker is exactly in line with the intended direction of the DMA beam. As used herein, the word “equalization” may be used interchangeably with “compensator.”
Several microphone array systems can pick up sound in all directions. For example, Polycom® Soundstation® speakerphones have been designed with directional microphones generally located in one of three legs of the Polycom® speakerphone device. In another instance, the LifeSize® phone was a conferencing phone that used twelve omni-directional microphones arranged around the circular perimeter of the circular device. In yet another instance, the Amazon Echo® product used a seven microphone array with six microphones arranged in a circle of diameter of 3.25 inches (82 mm) with one microphone located in the center with all the microphones located on top of the cylindrical device.
In the case of speakerphone devices that use directional microphones, such as the Polycom® Soundstation® products, the devices can be bulky because the directional microphones require space around each of the uni-directional microphones to create a sound field that will allow the directional microphones to operate directionally. The directional microphones, also called pressure gradient microphones, require that the front and rear ports of the microphone detect sound waves that have not been distorted by nearby surfaces.
In each of these speakerphone devices, a search is made to determine the direction of the active talker. A decision is made as to which way to point. Either a directional microphone is selected or a beam is formed to pick up the sound in the active direction. This is how sound is picked up with the least amount of background noise or reverberation. It is not possible to pick up sound in all directions at the same time without letting in more noise and reverberation. In the case where the direction-finding algorithm cannot make an absolute decision on which direction to pick up sound, then the algorithm may compromise and either pick up from two microphones, in the case of a speakerphone with uni-directional microphones, or it may cast a beam with a broad lobe if a beam is formed using several microphones. In the case where there is no determination at all about the direction, then it is possible for the direction-finding algorithm to fall back and use merely a single microphone on a temporary basis until sound arrives in a definitive direction.
Instead of using directional microphones, it is possible to use omni-directional microphones in a microphone array to achieve directionality. Omni-directional microphones are identically the same as pressure microphones. That is, they only sense sound pressure and not the gradient of sound pressure. A challenge is that the range of frequencies necessary to represent voice is very large, spanning approximately six octaves. For modern speech communication and speech recognition, it is often desirable to be able to pick up sound between 100 Hz to 7000 Hz. A problem for microphone arrays is that this is a very large range of wavelengths to support. At 100 Hz, the wavelength is approximately 3.4 meters. At 7000 Hz, the wavelength is approximately 0.049 meters, a ratio of 70:1. As noted above, a delay and sum array configuration is not able to support that ratio without a huge number of widely spaced microphones. Consequently, it is necessary to use a differential array. Differential arrays work by measuring the gradient of sound, and hence they measure the rate of change of sound pressure. To present a flat frequency response, it is necessary to equalize (or compensate) the result of the difference, which itself can create a high level of noise especially at the low frequencies. To combat the noise, it is possible to move the microphones further apart to decrease the noise, but this limits the ability of the differential array to work at high frequencies.
Configurations of basic two-microphone arrays for beamforming include: a broadside delay and sum array (“broadside array”) shown in FIG. 1; an end-fire delay and sum array (“end-fire array”) shown in FIG. 2; and a differential cardioid array (“differential array”) shown in FIG. 3.
Referring to FIG. 1, a broadside array can provide directivity for a narrow range of frequencies of the desired speech. The output of each microphone sensor is weighted by ½ and summed. The ½ weighting preserves the original amplitude of the incoming sound at the output of the summer. A broadside array configuration can achieve significant directionality (a null to 90°) when the spacing between the microphones equals one half the wavelength of the frequency of the sound. The formula for this frequency is: fb=c/(2×d), where c is the speed of sound and d is the distance between the microphones. The overall directivity at this frequency is only about 4 dB at the frequency fb. Above this frequency, there may be more directivity, but at a cost of very significant side lobes. This array is generally useful between approximately 0.8×fb and 1.5×fb.
Referring to FIG. 2, an end-fire array can provide better directivity than the broadside array, and can be more effective at lower frequencies of the desired speech. The polar response of the end-fire array resembles the familiar cardioid pattern when the distance between the microphones is one fourth of the wavelength. The formula for this frequency is: fe=c/(4×d), where c is the speed of sound and d is the distance between the microphones. Like the broadside array, an end-fire array is generally useful between a limited range, approximately 0.8×fe and 1.5×fe.
Referring to FIG. 3A, a differential array can potentially operate over the whole frequency range if the individual microphones are close together. Using two microphones to make a cardioid polar response, a differential array can produce a nearly cardioid response up to fd=c/(4×d), where c is the speed of sound and d is the distance between microphones. Above this frequency, the cardioid shape starts to bulge out in the sideways directions relative to the direction of the main beam. In addition, it may become necessary to add more gain to the compensation; e.g., via a compensating filter. For example, below 0.5×fd, it may be necessary to add gain to compensate for the fact that the differential array is a differentiator, so it needs to be compensated by a compensating filter that resembles an integrator. The compensating filter may be a gain and phase equalizer with a slope of 6 dB per octave as the frequency approaches 0 Hz. This filter can dramatically raise the self-noise at low frequencies. The level of noise may be unacceptable. In order to lower the noise floor, one may use microphones that have lower self-noise, or employ a design that spreads the microphones farther apart.
Referring to FIG. 3B, a generalized microphone array can be utilized which can make any style of beam by adjusting the coefficients and the delays. A goal in making a beam is to make a beam with a reasonable main lobe, low sidelobes, low self-noise, and low sensitivity to variations in microphone sensitivity that is effective across a wide range of frequencies. This may be accomplished by designing a beam that is a hybrid of additive beam and differential beams.
Since there is no perfect beamforming technique, some systems may select a different beamforming technique for different frequencies; e.g., broadside, end-fire, differential beams, or hybrids of these techniques. In U.S. Pat. No. 7,970,151, for example, the disclosure uses a large number of omni-directional microphones, twelve. The disclosure describes the use of an additive array at the high frequencies and a differential microphone array at low frequencies. A disadvantage is that the geometry may force the beam created by the additive array to be very narrow with significant side lobes. The narrow beam would make the system very fragile when the talker is moving or if the direction-finding was not accurate. The number of microphones, twelve, can add significant cost to the device.
These and other issues are addressed by embodiments of the present disclosure, individually and collectively.