It is well known that a human being can focus attention on a single source of sound even in an environment that contains many such sources. This phenomenon is often called the "cocktail-party effect."
Considerable effort has been devoted in the prior art to solve the cocktail-party effect, both in physical devices and in computational simulations of such devices. One prior technique is to separate sound based on auditory scene analysis. In this analysis, vigorous use is made of assumptions regarding the nature of the sources present. It is assumed that a sound can be decomposed into small elements such as tones and bursts, which in turn can be grouped according to attributes such as harmonicity and continuity in time. Auditory scene analysis can be performed using information from a single microphone or from several microphones. For an early example of auditory scene analysis, see Weintraub (1984, 1985, 1986). Other prior art work related to sound separation by auditory scene analysis are due to Parsons (1976), von der Malsburg and Schneider (1986), Naylor and Porter (1991), and Mellinger (1991).
Techniques involving auditory scene analysis, although interesting from a scientific point of view as models of human auditory processing, are currently far too computationally demanding and specialized to be considered practical techniques for sound separation until fundamental progress is made.
Other techniques for separating sounds operate by exploiting the spatial separation of their sources. Devices based on this principle vary in complexity. The simplest such devices are microphones that have highly selective, but fixed patterns of sensitivity. A directional microphone, for example, is designed to have maximum sensitivity to sounds emanating from a particular direction, and can therefore be used to enhance one audio source relative to others (see Olson, 1967). Similarly, a close-talking microphone mounted near a speaker's mouth rejects distant sources (see, for example, the Knowles CF 2949 data sheet).
Microphone-array processing techniques related to separating sources by exploiting spatial separation of their sources are also well known and have been of interest for several decades. In one early class of microphone-array techniques, nonlinear processing is employed. In each output stream, some source direction of arrival, a "look direction," is assumed. The microphone signals are delayed to remove differences in time of flight from the look direction. Signals from any direction other than the look direction are thus misaligned in time. The signal in the output stream is formed, in essence, by "gating" sound fragments from the microphones. At any given instant, the output is chosen to be equal to one of the microphone signals. These techniques, exemplified by Kaiser and David (1960), by Mitchell et al. (1971), and by Lyon (1983), perform best when the undesired sources consist predominantly of impulse trains, as is the case with human speech. While these nonlinear techniques can be very computationally efficient and are of scientific interest as models of human cocktail-party processing, they do not have practical or commercial significance because of their inherent inability to bring about full suppression of unwanted sources. This inability originates from the incorrect assumption that at every instant in time, at least one microphone contains only the desired signal.
One widely known class of techniques in the prior art for linear microphone-array processing is often referred to as "classical beamforming" (Flanagan et al., 1985). As with the nonlinear techniques mentioned above, processing begins with the removal of time-of-flight differences among the microphone signals with respect to the look direction. In place of the "gating" scheme, the delayed microphone signals are simply averaged together. Thus, any signal from the look direction is represented in the output with its original power, whereas signals from other directions are relatively attenuated.
Classical beamforming was employed in a patented directional hearing aid invented by Widrow and Brearley (1988). The degree to which a classical beamformer is able to attenuate undesired sources relative to the desired source is limited by (1) the number of microphones in the array, and (2) the spatial extent of the array relative to the longest wavelength of interest present in the undesired sources. In particular, a classical beamformer cannot provide relative attenuation of frequency components whose wavelengths are larger than the array. For example, an array one foot wide cannot greatly attenuate frequency components below approximately 1 kHz.
Also known from the prior art is a class of active-cancellation algorithms, which is related to sound separation. However, it needs a "reference signal," i.e., a signal derived from only of one of the sources. For example, active noise-cancellation techniques (see data sheets for Bose.RTM. Aviation Headset, NCT proACTIVE.RTM. Series, and Sennheiser HDC451 Noiseguard.RTM. Mobile Headphone) reduce the contribution of noise to a mixture by filtering a known signal that contains only the noise, and subtracting it from the mixture. Similarly, echo-cancellation techniques such as those employed in full-duplex modems (Kelly and Logan, 1970; Gritton and Lin, 1984) improve the signal-to-noise ratio of an outgoing signal by removing noise due to echoes from the known incoming signal.
Techniques for active cancellation that do not require a reference signal are called "blind." They are now classified, based on the degree of realism of the underlying assumptions regarding the acoustic processes by which the unwanted signals reach the microphones. To understand the practical significance of this classification, recall a feature common to the principles by which active-cancellation techniques operate: the extent to which a given undesired source can be canceled by subtracting processed microphone signals depends ultimately on the exactness with which copies of the undesired source in the different microphones can be made to match one another. This depends, in turn, on how accurately the signal processing models the acoustic processes by which the unwanted signals reach the microphones.
One class of blind active-cancellation techniques may be called "gain-based": it is presumed that the waveform produced by each source is received by the microphones simultaneously, but with varying relative gains. (Directional microphones must be employed to produce the required differences in gain.) Thus, a gain-based system attempts to cancel copies of an undesired source in different microphone signals by applying relative gains to the microphone signals and subtracting, but never applying time delays or otherwise filtering. Numerous gain-based methods for blind active cancellation have been proposed; see Herault and Jutten (1986), Bhatti and Bibyk (1991), Cohen (1991), Tong et al. (1991), and Molgedey and Schuster (1994).
The assumption of simultaneity is violated when microphones are separated in space. A class of blind active-cancellation techniques that can cope with non-simultaneous mixtures may be called "delay-based": it is assumed that the waveform produced by each source is received by the various microphones with relative time delays, but without any other filtering. (See Morita, 1991 and Bar-Ness, 1993.) Under anechoic conditions, this assumption holds true for a microphone array that consists of omnidirectional microphones. However, this simple model of acoustic propagation from the sources to the microphones is violated when echoes and reverberation are present.
When the signals involved are narrowband, some gain-based techniques for blind active cancellation can be extended to employ complex gain coefficients (see Strube (1981), Cardoso (1989,1991), Lacoume and Ruiz (1992), Comon et al. (1994)) and can therefore accommodate, to a limited degree, time delays due to microphone separation as well as echoes and reverberation. These techniques can be adapted for use with audio signals, which are broadband, if the microphone signals are divided into narrowband components by means of a filter bank. The frequency bands produced by the filter bank can be processed independently, and the results summed (for example, see Strube (1981) or the patent of Comon (1994)). However, they are computationally intensive relative to the present invention because of the duplication of structures across frequency bands, are slow to adapt in changing situations, are prone to statistical error, and are extremely limited in their ability to accommodate echoes and reverberation.
The most realistic active-cancellation techniques currently known are "convolutive": the effect of acoustic propagation from each source to each microphone is modeled as a convolutive filter. These techniques are more realistic than gain-based and delay-based techniques because they explicitly accommodate the effects of inter-microphone separation, echoes and reverberation. They are also more general since, in principle, gains and delays are special cases of convolutive filtering.
Convolutive active-cancellation techniques have recently been described by Jutten et al. (1992), by Van Compernolle and Van Gerven (1992), by Platt and Faggin (1992), and by Dinc and Bar-Ness (1994). While these techniques have been used to separate mixtures constructed by simulation using oversimplified models of room acoustics, to the best of our knowledge none of them has been applied successfully to signals mixed in a real acoustic environment. The simulated mixtures used by Jutten et al., by Platt and Faggin, and by Dinc and Bar-Ness differ from those that would arise in a real room in two respects. First, the convolutive filters used in the simulations are much shorter than those appropriate for modeling room acoustics; they allowed for significant indirect propagation of sound over only one or two feet, compared with tens of feet typical of echoes and reverberation in an office. Second, the mixtures used in the simulations were partially separated to begin with, i.e., the crosstalk between the channels was weak. In practice, the microphone signals must be assumed to contain strong crosstalk unless the microphones are highly directional and the geometry of the sources is constrained.
To overcome some of the limitations of the convolutive active-cancellation techniques named above, the present invention employs a two-stage architecture. Its two-stage architecture is substantially different from other two-stage architectures found in prior art.
A two-stage signal processing architecture is employed in a Griffiths-Jim beamformer (Griffiths and Jim, 1982). The first stage of a Griffiths-Jim beamformer is delay-based: two microphone signals are delayed to remove time-of-flight differences with respect to a given look direction, and in contrast with classical beamforming, these delayed microphone signals are subtracted to create a reference noise signal. In a separate channel, the delayed microphone signals are added, as in classical beamforming, to create a signal in which the desired source is enhanced relative to the noise. Thus, the first stage of a Griffiths-Jim beamformer produces a reference noise signal and a signal that is predominantly desired source. The noise reference is then employed in the second stage, using standard active noise-cancellation techniques, to improve the signal-to-noise ratio in the output.
The Griffiths-Jim beamformer suffers from the flaw that under reverberant conditions, the delay-based first stage cannot construct a reference noise signal devoid of the desired signal, whereas the second stage relies on the purity of that noise reference. If the noise reference is sufficiently contaminated with the desired source, the second stage suppresses the desired source, not the noise (Van Compernolle, 1990). Thus, the Griffiths-Jim beamformer incorrectly suppresses the desired signal under conditions that are normally considered favorable: when the signal-to-noise ratio in the microphones is high.
Another two-stage architecture is described by Najar et al. (1994). Its second stage employs blind convolutive active cancellation. However, its first stage differs significantly from the first stage of the Griffiths-Jim beamformer. It attempts to produce separated outputs by adaptively filtering each microphone signal in its own channel. When the sources are spectrally similar, filters that produce partially separated outputs after the first stage are unlikely to exist.
Thus, it is desirable to provide an architecture for separation of sources that avoids the difficulties exhibited by existing techniques.