Field of the Invention
The present invention relates to a sound source separation technique.
Description of the Related Art
Recently, moving image capturing can be performed not only by a video camera but also by a digital camera, and opportunities of picking up (recording) sounds at the same time are increasing. This poses the problem that a sound other than a target sound is mixed when picking up the target sound. Therefore, researches have been made to extract only a desired signal from a sound signal in which sounds from a plurality of sound sources are mixed. For example, a sound source separation technique performed by array signal processing using a plurality of microphone signals such as a beam former or independent component analysis (ICA) has extensively been studied.
Unfortunately, this sound source separation technique performed by the conventional array signal processing poses the problem (under-determined problem) that it is impossible to simultaneously separate sound sources larger in number than microphones. As a method which has solved this problem, a sound source separation method using a multi-channel Wiener filter is known. A literature disclosing this technique is as follows.
N. Q. K. Duong, E. Vincent, R. Gribonval, “Under-Determined Reverberant Audio Source Separation Using a Full-rank Spatial Covariance Model”, IEEE transactions on Audio, Speech and Language Processing, vol. 18, No. 7, pp. 1830-1840, September 2010.
This literature will briefly be explained. Assume that M (≧2) microphones pick up sound source signals sj (j=1, 2, . . . , J) generated from J sound sources. To simplify the explanation, assume that the number of microphones is two. An observation signal X obtained by the two microphones can be written as follows:X(t)=[x1(t)x2(t)]T where [ ]T represents the transpose of a matrix, and t represents time.
Performing time-frequency conversion on this observation signal yields:X(f,n)=[x1(n,f)x2(n,f)]T (f represents a frequency bin, and n represents the number of frames (n=1, 2, . . . , N)).
Letting hj(f) be the transmission characteristic from a sound source to a microphone, and cj(n,f) be a signal (to be referred to as a source image hereinafter) of each sound source observed by a microphone, the observation signal can be written as superposition of signals of the source sources as follows:
                              X          ⁡                      (                          n              ,              f                        )                          =                                            ∑              j                        ⁢                          cj              ⁡                              (                                  n                  ,                  f                                )                                              =                                    ∑              j                        ⁢                                          sj                ⁡                                  (                                      n                    ,                    f                                    )                                            *                              hj                ⁡                                  (                  f                  )                                                                                        (        1        )            It is assumed that the sound source position does not move during the sound pickup time, and the transfer characteristic hj(f) from a sound source to a microphone does not change with time.
Furthermore, letting Rcj(n,f) be the correlation matrix of a source image, vj(n,f) be the variance of each time-frequency bin of the sound source signal, and Rj(f) be a time-independent spatial correlation matrix of each sound source, assume that the following relationship holds:Rcj(n,f)=vj(n,f)*Rj(f)  (2)forRcj(n,f)=cj(n,f)*cj(n,f)H where ( )H represents Helmitian transpose.
By using the above relationship, the probability at which the observation signal is obtained as superposition of all sound images is given, and parameter estimation is performed using an EM algorithm. In E-step:Wj(n,f)=Rcj(n,f)·Rx−1(n,f)  (3)ĉj(n,f)=Wj(n,f)·X(n,f)  (4){circumflex over (R)}cj(n,f)=ĉj(n,f)*ĉjH(n,f)+(I−Wj(n,f))·Rcj(n,f)  (5)In M-step S:
                              vj          ⁡                      (                          n              ,              f                        )                          =                              1            M                    ⁢                      tr            ⁡                          (                                                                                          Rj                                              -                        1                                                              ⁡                                          (                      f                      )                                                        ·                                      R                    ^                                                  ⁢                                  cj                  ⁡                                      (                                          n                      ,                      f                                        )                                                              )                                                          (        6        )                                          Rj          ⁡                      (            f            )                          =                              1            N                    ⁢                                    ∑                              n                =                1                            N                        ⁢                                          1                                  vj                  ⁡                                      (                                          n                      ,                      f                                        )                                                              ⁢                              R                ^                            ⁢                              cj                ⁡                                  (                                      n                    ,                    f                                    )                                                                                        (        7        )                                          Rx          ⁡                      (                          n              ,              f                        )                          =                              ∑            j                    ⁢                                    vj              ⁡                              (                                  n                  ,                  f                                )                                      ·                          Rj              ⁡                              (                f                )                                                                        (        8        )            
By iteratively performing the above calculations, the parameters Rcj(n,f) (=vj(n,f)*Rj(f)) and Rx(n,f) for generating the multi-channel Wiener filter for performing sound source separation can be obtained. An estimated value of the source image cj(n,f) as the observation signal of each sound source is output by using the calculated parameter as follow:cj(n,f)=Rcj(n,f)·Rx(n,f)−1X(n,f)   (9)
In the above-mentioned conventional method, it is assumed that the sound source position does not move during the sound pickup time, in order to stably obtain the spatial correlation matrix. This poses the problem that no stable sound source separation can be performed if, for example, the relative positions of a sound source and sound pickup device change (for example, when the sound source itself moves or the sound pickup device such as a microphone array rotates or moves).