A wide variety of acoustic transducers, such as microphones, are commonly used to acquire sounds from a target audio source, such as speech from a human speaker. The quality of the sound acquired by microphones is adversely affected by a variety of factors, such as attenuation over the distance between the target audio source to the microphone(s), interference from other acoustic sources particularly in high noise environments, and sound wave reverberation and echo.
One way to mitigate these effects is to use a directional audio system, such as a shotgun microphone, a parabolic dish microphone, or a microphone array beamformer. All three approaches create constructive and destructive interference patterns between sounds arriving at them to create directional audio pickup patterns that discriminate based upon those angles of arrival. Beamforming broadly describes a class of array processing techniques that are operable to create/form a pickup pattern through a combination of multiple microphones to form an interference pattern (i.e., a “beam”). Beamforming techniques may be broadly classified as either data-independent (i.e., where the directional pickup pattern is fixed until re-steered) or data-dependent (i.e., where the directional pickup pattern automatically adapts its shape depending from which angle target and non-target sounds arrive). Prior art microphone array beamforming systems include, broadly, a plurality of microphone transducers that are arranged in a spatial configuration relative to each other. Some embodiments allow electronic steering of the directional audio pickup pattern through the application of electronic time delays to the signals produced by each microphone transducer to create the steerable directional audio pickup pattern. Combining the signals may be accomplished by various means, including acoustic waveguides (e.g., U.S. Pat. No. 8,831,262 to McElveen), analog electronics (e.g., U.S. Pat. No. 9,723,403 to McElveen), and digital electronics (e.g., U.S. Pat. No. 9,232,310 to Huttunen et al.). The digital systems include a microphone array interface for converting the microphone transducer output signals into a different form suitable for processing by a digital computing device. The digital systems also include a computing device such as a digital processor or computer that receives and processes the converted microphone transducer output signals and a computer program that includes computer readable instructions, which when executed processes the signals. The computer, the computer readable instructions when executed, and the microphone array interface form structural and functional modules for the microphone array beamforming system.
Apart from sound acquisition enhancement from selected sound source directions in an acoustic space, a further advantage of microphone array systems in general is the ability to locate and track prominent sound sources in the acoustic space. Two common techniques of sound source location are known as the time difference of arrival (TDOA) method and the steered response power (SRP) method, which can be used either alone or in combination.
As mentioned above, microphone array beamforming techniques are commonly used to reduce the amount of reverberation captured by the transducers. Excessive reverberation negatively affects the intelligibility and quality of captured audio as perceived by human listeners, as well as the performance of automatic speech recognition and speech biometric systems. Reverberation is reduced by microphone array beamformers by reducing the contribution of sounds received from directions other than the target direction (i.e., where the “beam” is directed).
In scenarios having multiple sound sources, such as when a group of speakers are engaged in conversation, e.g. around a table, the sound source location or active speaker position in relation to the microphone array changes. In addition, more than one speaker may speak at a given time, producing a significant amount of simultaneous speech from different speakers in different directions relative to the array. Furthermore, more than one sound source may be located in the same general direction relative to the array and therefore cannot be discriminated solely using direction of arrival techniques, such as microphone array beamforming. In such a complex environment, the effective acquisition of target sound sources requires simultaneous beamforming in multiple directions in the reception space around the microphone array to execute the aforementioned data-adaptive technique. This requires fast and accurate processing techniques to enable the sound source location and robust beamforming techniques to mitigate the deleterious effects listed above. Even with an ideal implementation, if sound sources lie in the same direction relative to the array, these techniques will not suffice to discriminate between the sources, and real-world implementations still fall far short of the ideal.
Equally spaced array configurations (where the inter-element distances between the transducers are approximately equal) are known to have inherent limitations arising from the geometrical symmetry of their transducer arrangements, including increased pickup of sounds from untargeted directions through side lobes in their pickup patterns. These issues may be alleviated by using microphone arrays having asymmetric geometries. For example, U.S. Pat. No. 9,143,879 to McElveen provides for a directional microphone array having an asymmetric transducer geometry based on a mathematical sequence configured to enable scaling the array while maintaining asymmetric geometry. Prior art solutions have attempted to provide for distributed or non-equally spaced microphone arrays to improve sound acquisition from multiple sound sources falling outside an array plane. For example, U.S. Pat. No. 8,923,529 to McCowan provides for an array of microphone transducers that are arranged relative to each other in N-fold rotational symmetry and a beamformer that includes beamformer weights associated with one of a plurality of spatial reception sectors corresponding to the N-fold rotational symmetry of the microphone array. However, such solutions require additional prior knowledge and control of the array, such as the spatial locations of the array elements, and do not effectively accommodate real-world acoustic conditions, such as large reflective surfaces in the acoustic space.
The design of beamforming arrays needs to take into account multiple factors, such as the range of audio frequencies that need to be beamformed; the amount of ambient, reverberant noise that is anticipated; the distance to the nearest and furthest target source; the need for fixed, user-selected, or automatic steering; the angles that sounds may arrive at the array from in the horizontal and vertical directions; and the spatial resolution of the pickup pattern (i.e., how wide the main lobe of the pickup pattern is). As a consequence, beamforming arrays that are designed to operate in loud, cluttered, or dynamic environments from a distance more than approximately an arm's length away, tend to include tens or even hundreds of transducers.
The pickup patterns of real-world microphone beamformer arrays are known to be significantly different from estimations used in their design due to variations between microphones. Consequently, microphone arrays require calibration, which involves additional time, complication, and expense.
Another way that has been explored to mitigate the effects of simultaneous noises, including co-speech, is through the use of what are known as blind source separation (BSS) algorithms. Several BSS approaches have been attempted over the last several decades, including principal component analysis, independent component analysis (ICA), spatio-temporal analysis, and sparse component analysis. At the current time, most real-world embodiments implement some variation of ICA. BSS algorithms are grouped according to whether they are over-determined (i.e. requiring more microphones than the number of real and virtual (reflected) interferers) or under-determined (i.e., have fewer microphones than the number of real and virtual interferers). In a highly reverberant acoustical environment, a few “real” sources can be quickly reflected into what appears to human hearing and mathematical algorithms as being a large number of sound sources because each reflection of a real source becomes, in effect, a “virtual” source and, thus, an additional interferer. In a mathematical sense, the problem referred to above that beamformers have in reverberation is related to that faced by blind source separation approaches—a multitude of interferers requires a large number of microphones to overcome. In mathematics, this problem is also found in solving simultaneous equations—for every unknown variable one is trying to solve for, one needs an independent equation with that variable, or in terms of solving cocktail party problems, for every real or virtual acoustic source, one needs an independent (i.e., spatially separated in a physical sense and without other dependency, such as cross-talk, between the microphones) acoustic recording of it. The real-world effect of this underlying mathematical problem is that blind source separation algorithms require a relatively large number of microphones to perform well in crowded, reverberant environments and may suffer from a significant amount of processing delay (also known as lag) in trying to unmix the various sound sources. In under-determined cases, BSS either does not work at all or results in very high levels of noise and distortion.
Another way that has been explored to mitigate the effects of simultaneous noises, including co-speech, is through the use of what is known as computational auditory scene analysis (CASA), which attempts to replicate or mimic the abilities of the human auditory system to separate (unmix) sound sources using computing devices. CASA algorithms by popular agreement constrain themselves to only one or two microphones, based on the corresponding limitations in humans, and therefore focus on the mathematically under-determined case. CASA algorithms are known to perform well only in situations where the target talker signal level is high relative to the background noise signal level, including co-speech and reverberation (i.e., high SNR situations).
Through applied effort, ingenuity, and innovation, Applicant has developed a solution that addresses a number of the deficiencies and problems with prior microphone array systems, associated microphone array processing methods, prior blind source separation methods, and prior methods that mimic the human auditory system. Applicant's solution is embodied by the present invention, which is described in detail below.