Sound source separation is the process of separating into separate signals two or more sound sources from at least that many number of recorded microphone signals. For example, within a conference room, there may be five different people talking, and five microphones placed around the room to record their conversations. In this instance, sound source separation involves separating the five recorded microphone signals into a signal for each of the speakers. Sound source separation is used in a number of different applications, such as speech recognition. For example, in speech recognition, the speaker's voice is desirably isolated from any background noise or other speakers, so that the speech recognition process uses the cleanest signal possible to determine what the speaker is saying.
The diagram 100 of FIG. 1 shows an example environment in which sound source separation may be used. The voice of the speaker 104 is recorded by a number of differently located microphones 106, 108, 110, and 112. Because the microphones are located at different positions, they will record the voice of the speaker 104 at different times, at different volume levels, and with different amounts of noise. The goal of the sound source separation in this instance is to isolate in a single signal just the voice of the speaker 104 from the recorded microphone signals. Typically, the speaker 104 is modeled as a point source, although it is more diffuse in reality. Furthermore, the microphones 106, 108, 110, and 112 can be said to make up a microphone array. The pickup pattern of FIG. 1 tends to be less selective at lower frequencies.
One approach to sound source separation is to use a microphone array in combination with the response characteristics of each microphone. This approach is referred to as delay-and-sum beamforming. For example, a particular microphone may have the pickup pattern 200 of FIG. 2. The microphone is located at the intersection of the x axis 210 and the y axis 212, which is the origin. The lobes 202, 204, 206, and 208 indicate where the microphone is most sensitive. That is, the lobes indicate where the microphone has the greatest response, or gain. For example, the microphone modeled by the graph 200 has the greatest response where the lobe 202 intersects with the y axis 212 in the negative y direction.
By using the pickup pattern of each microphone, along with the location of each microphone relative to the fixed position of the speaker, delay-and-sum beamforming can be used to separate the speaker's voice as an isolated signal. This is because the incidence angle between each microphone and the speaker can be determined a priori, as well as the relative delay in which the microphones will pick up the speaker's voice, and the degree of attenuation of the speaker's voice when each microphone records it. Together, this information is used to separate the speaker's voice as an isolated signal.
However, the delay-and-sum beamforming approach to sound source separation is useful primarily only in soundproof rooms, and other near-ideal environments where no reverberation is present. Reverberation, or “reverb,” is the bouncing of sound waves off surfaces such as walls, tables, windows, and other surfaces. Delay-and-sum beamforming assumes that no reverb is present. Where reverb is present, which is typically the case in most real-world situations where sound source separation is desired, this approach loses its accuracy in a significant manner.
An example of reverb is depicted in the graph 300 of FIG. 3. The graph 300 depicts the sound signals picked up by a microphone over time, as indicated by the time axis 302. The volume axis 304 indicates the relative amplitude of the volume of the signals recorded by the microphone. The original signal is indicated as the signal 306. Two reverberations are shown as a first reverb signal 308, and a second reverb signal 310. The presence of the reverb signals 308 and 310 limits the accuracy of the sound source separation using the delay-and-sum beamforming approach.
Another approach to sound source separation is known as independent component analysis (ICA) in the context of instantaneous mixing. This technique is also referred to as blind source separation (BSS). BSS means that no information regarding the sound sources is known a priori, apart from their assumed mutual statistical independence. In laboratory conditions, ICA in the context of instantaneous mixing achieves signal separation up to a permutation limitation. That is, the approach can separate the sound sources correctly, but cannot identify which output signal is the first sound source, which is the second sound source, and so on. However, BSS also fails in real-world conditions where reverberation is present, since it does not take into account reverb of the sound sources.
Mathematically, ICA for instantaneous mixing assumes that R microphone signals, yi[n],y[n]=(y1[n],y2[n], . . . yR[n]), are obtained by a linear combination of R sound source signals xi[n],x[n]=(x1[n],x2[n], . . . , xR[n]). This is written as:y[n]=Vx[n]  (1)for all n, where V is the R×R mixing matrix. The mixing is instantaneous in that the microphone signals at any time n depend on the sound source signals at the same time, but at no earlier time. In the absence of any information about the mixing, the BSS problem estimates a separating matrix W=V−1 from the recorded microphone signals alone. The sound source signals are recovered by:x[n]=Wy[n].  (2)
A criterion is selected to estimate the unmixing matrix W. One solution is to use the probability density function (pdf) of the source signals, px(x[n]), such that the pdf of the recorded microphone signals is:py(y[n])=|W|px(Wy[n]).  (3)Because the sound source signals are assumed to be independent from themselves over time, x[n+i],i≠0, the joint probability is:
                                                                        ⅇ                ψ                            =                            ⁢                                                p                  y                                ⁡                                  (                                                            y                      ⁡                                              [                        0                        ]                                                              ,                                          y                      ⁡                                              [                        1                        ]                                                              ,                    …                    ⁢                                                                                  ,                                          y                      ⁡                                              [                                                  N                          -                          1                                                ]                                                                              )                                                                                                        =                            ⁢                                                ∏                                      n                    =                    1                                                        N                    -                    1                                                  ⁢                                                                  ⁢                                                      p                    y                                    ⁡                                      (                                          y                      ⁡                                              [                        n                        ]                                                              )                                                                                                                          =                            ⁢                                                                                        W                                                        N                                ⁢                                                      ∏                                          n                      =                      0                                                              N                      -                      1                                                        ⁢                                                                          ⁢                                                                                    p                        x                                            ⁡                                              (                                                  Wy                          ⁡                                                      [                            n                            ]                                                                          )                                                              .                                                                                                          (        4        )            The gradient of Ψ is:
                                                        ∂              ψ                                      ∂              W                                =                                                    (                                  W                  T                                )                                            -                1                                      +                                          1                N                            ⁢                                                ∑                                      n                    =                    1                                                        N                    -                    1                                                  ⁢                                                                  ⁢                                                      ϕ                    ⁡                                          (                                              Wy                        ⁡                                                  [                          n                          ]                                                                    )                                                        ⁢                                                            (                                              y                        ⁡                                                  [                          n                          ]                                                                    )                                        T                                                                                      ,                            (        5        )            where φ(x) is:
                              ϕ          ⁡                      (            x            )                          =                                                            ∂                ln                            ⁢                                                          ⁢                                                p                  x                                ⁡                                  (                  x                  )                                                                    ∂              x                                .                                    (        6        )            
From equations (4), (5), and (6), a gradient descent solution, known as the infomax rule, can be obtained for W given px(x). That is, given the probability density function of the sound source signals, the separating matrix W can be obtained. The density function px(x) may be Gaussian, Laplacian, a mixture of Gaussians, or another type of prior, depending on the degree of separation desired. For example, a Laplacian prior or a mixture of Gaussian priors generally yields better separation of the sound source signals from the recorded microphone signals than a Gaussian prior does.
As has been indicated, however, although the ICA approach in the context of instantaneous mixing does achieve sound source signal separation in environments where reverberation is non-existent, the approach is unsatisfactory where reverb is present. Because reverb is present in most real-world situations, therefore, the instantaneous mixing ICA approach is limited in its practicality. An approach that does take into account reverberation is known as convolutional mixing ICA. Convolutional mixing takes into consideration the transfer functions between the sound sources and the microphones created by environmental acoustics. By considering environmental acoustics, convolutional mixing thus takes into account reverberation.
The primary disadvantage to convolutional mixing ICA is that, because it operates in the frequency domain instead of in the time domain, the permutation limitation of ICA occurs on a per-frequency component basis. This means that the reconstructed sound source signals may have frequency components belonging to different sound sources, resulting in incomprehensible reconstructed signals. For example, in the diagram 400 of FIG. 4, the output sound source signal 402 is reconstructed by convolutional mixing ICA from two sound source signals, a first sound source signal 404, and a signal sound source signal 406. Each of the signals 402, 404, and 406 has a frequency spectrum from a low frequency fL to a high frequency fH. The output signal 402 is meant to reconstruct either the first signal 404 or the second signal 406.
However, in actuality, the first frequency component 408 of the output signal 402 is that of the second signal 406, and the second frequency component 410 of the output signal 402 is that of the first signal 404. That is, rather than the output signal 402 having the first and the second components 412 and 410 of the first signal 404, or the first and the second components 408 and 414 of the second signal 406, it has the first component 408 from the second signal 406, and the second component 410 from the first signal 404. To the human ear, and for applications such as speech recognition, the reconstructed output sound source signal 402 is meaningless.
Mathematically, convolutional mixing ICA is described with respect to two sound sources and two microphones, although the approach can be extended to any number of R sources and microphones. An example environment is shown in the diagram 500 of FIG. 5, in which the voices of a first speaker 502 and a second speaker 504 are recorded by a first microphone 506 and a second microphone 508. The first speaker 502 is represented as the point sound source x1[n], and the second speaker 502 is represented as the point sound source x2[n]. The first microphone 506 records the microphone signal y1[n], whereas the second microphone 508 records the microphone signal y2[n]. The input signals x1[n] and x2[n] are said to be filtered with filters gij[n] to generate the microphone signals, where the filters gij[n] take into account the position of the microphones, room acoustics, and so on. Reconstruction filters hij[n] are then applied to the microphone signals y1[n] and y2[n] to recover the original input signals, as the output signals {circumflex over (x)}1[n] and {circumflex over (x)}2[n].
This model is shown in the diagram 600 of FIG. 6. The voice of the first speaker 502, x1[n], is affected by environmental and other factors indicated by the filters 602a and 602b, represented as g11[n] and g12[n]. The voice of the second speaker 504, x2[n], is affected by environmental and other factors indicated by the filters 602c and 602d, represented as g21[n] and g22[n]. The first microphone 506 records a microphone signal y1[n] equal to x1[n]*g11[n]+x2[n]*g21[n], where * represents the convolution operator defined as
      y    ⁡          [      n      ]        =                    x        ⁡                  [          n          ]                    *              h        ⁡                  [          n          ]                      =                  ∑                  m          =                      -            ∞                          ∞            ⁢                          ⁢                        x          ⁡                      [            m            ]                          ⁢                              h            ⁡                          [                              n                -                m                            ]                                .                    The second microphone 508 records a microphone signal y2[n] equal to x2[n]*g22[n]+x1[n]*g12[n]. The first microphone signal y1[n] is input into the reconstruction filters 604a and 604b, represented by h11[n] and h12[n]. The second microphone signal y2[n] is input into the reconstruction filters 604c and 604d, represented by h21[n] and h22[n]. The reconstructed source signal 502′ is determined by solving {circumflex over (x)}1[n]=y1[n]*h11[n]+y2[n]*h21[n]. Similarly, the reconstructed source signal 504′ is determined by solving {circumflex over (x)}2[n]=y2[n]*h22[n]+y1[n]*h12[n].
The reconstruction filters 604a, 604b, 604c, and 604d, or hij[n], completely recovers the original signals of the speakers 502 and 504, or xi[n], if and only if their z-transforms are the inverse of the z-transforms of the mixing filters 602a, 602b, 602c, and 602d, or gij[n]. Mathematically, this is:
                                                                        (                                                                                                                              H                          11                                                ⁡                                                  (                          z                          )                                                                                                                                                              H                          12                                                ⁡                                                  (                          z                          )                                                                                                                                                                                                  H                          21                                                ⁡                                                  (                          z                          )                                                                                                                                                              H                          22                                                ⁡                                                  (                          z                          )                                                                                                                    )                            =                            ⁢                                                (                                                                                                                                          G                            11                                                    ⁡                                                      (                            z                            )                                                                                                                                                                            G                            12                                                    ⁡                                                      (                            z                            )                                                                                                                                                                                                                    G                            21                                                    ⁡                                                      (                            z                            )                                                                                                                                                                            G                            22                                                    ⁡                                                      (                            z                            )                                                                                                                                )                                                  -                  1                                                                                                        =                            ⁢                              1                                                                                                    G                        11                                            ⁡                                              (                        z                        )                                                              ⁢                                                                  G                        22                                            ⁡                                              (                        z                        )                                                                              -                                                            G                      12                                        ⁡                                          (                      z                      )                                                        -                                                            G                      21                                        ⁡                                          (                      z                      )                                                                                                                                                            ⁢                                                (                                                                                                                                          G                            11                                                    ⁡                                                      (                            z                            )                                                                                                                                                                            G                            12                                                    ⁡                                                      (                            z                            )                                                                                                                                                                                                                    G                            21                                                    ⁡                                                      (                            z                            )                                                                                                                                                                            G                            22                                                    ⁡                                                      (                            z                            )                                                                                                                                )                                .                                                                        (        7        )            
The mixing filters 602a, 602b, 602c, and 602d, or gij[n], can be assumed to be finite infinite response (FIR) filters, having a length that depends on environmental and other factors. These factors may include room size, microphone position, wall absorbance, and so on. This means that the reconstruction filters 604a, 604b, 604c, and 604d, or hij[n], have an infinite impulse response. Since using an infinite number of coefficients is impractical, the reconstruction filters are assumed to be FIR filters of length q, which means that the original signals from the speakers 502 and 504, xi[n], will not be recovered exactly as {circumflex over (x)}i[n]. That is, xi[n]≠{circumflex over (x)}i[n], but xi[n]≈{circumflex over (x)}i[n].
The convolutional mixing ICA approach achieves sound separation by estimating the reconstruction filters hij[n] from the microphone signals yj[n] using the infomax rule. Reverberation is accounted for, as well as other arbitrary transfer functions. However, estimation of the reconstruction filters hij[n] using the infomax rule still represents an less than ideal approach to sound separation, because, as has been mentioned, permutations can occur on a per-frequency component basis in each of the output signals {circumflex over (x)}i[n]. Whereas the BSS and instantaneous mixing ICA approaches achieve proper sound separation but cannot take into account reverb, the convolutional mixing infomax ICA approach can take into account reverb but achieves improper sound separation.
For these and other reasons, therefore, there is a need for the present invention.