1. Field of the Invention
The present invention relates to a signal processing apparatus, a signal processing method, and a program. More specifically, the present invention relates to a signal processing apparatus, a signal processing method, and a program, which separate signals in which a plurality of signals are mixed by using Independent Component Analysis (ICA), and further estimate the sound source direction.
2. Description of the Related Art
The present invention relates to a technique for estimating the sound source direction (the direction from which a sound arrives as viewed from a microphone; also referred to as Direction Of Arrival (DOA)). In particular, the present invention relates to a technique for estimating the sound source direction in real-time by using Independent Component Analysis (ICA).
Since the present invention concerns both sound source direction estimation and ICA, the both will be described as the related art in the following order. Then, problems associated with incorporating sound source direction estimation into real-time ICA will be described. The description will be given in the order of A and B below.
A. Description of the Related Art
B. Problems of the Related Art
[A. Description of the Related Art]
First, a description will be given of A1 to A4 below as the related art.
A1. Sound Source Direction Estimation Using Phase Difference Between Microphones
A2. Description of ICA
A3. Sound Source Direction Estimation Using ICA
A4. Real-Time Implementation of ICA
[A1. Sound Source Direction Estimation Using Phase Difference Between Microphones]
The most basic scheme for estimating the sound source direction is to exploit the arrival-time difference, phase difference, or the like by using a plurality of microphones. This scheme is introduced as the related art also in Japanese Unexamined Patent Application Publication No. 2005-77205, Japanese Unexamined Patent Application Publication No. 2005-49153, and Japanese Patent No. 3069663.
This basic sound source direction estimation scheme will be described with reference to FIG. 1. As shown in FIG. 1, an environment is considered in which there is a single sound source 11, and two microphones 12 and 13 are installed at a spacing d. If the distance from each of the microphones 12 and 13 to the sound source 11 is sufficiently large relative to the microphone spacing d, sound waves can be approximated by plane waves, in which case it can be assumed that sound arrives at the two microphones at the same angle θ. In FIG. 1, the angle θ is defined with the direction orthogonal to the line connecting the two microphones (microphone-pair direction) taken as 0.
Comparing the path from the sound source 11 to the microphone 1, 12 with the path from the sound source 11 to the microphone 2, 13, the latter can be approximated as being longer by [d·sin θ], so the signal observed at the microphone 2, 13 lags in phase by an amount proportional to [d·sin θ] relative to the observed signal at the microphone 1, 12.
In FIG. 1, the angle θ is defined with the direction orthogonal to the line connecting the two microphones (microphone-pair direction) taken as 0. However, when an angle [θ′] formed by the microphone-pair direction and the sound wave propagation direction is used, the difference in distance to the sound source between the two microphones can be represented as [d·cos θ′].
Letting the observed signals at the microphone 1 and the microphone 2 be x1(t) and x2(t), respectively, and the path difference from the sound source 11 to the microphones 1 and 2 be [d·sin θ], the resulting relational expression is represented by Equation [1.1] below.
                                                        x              2                        ⁡                          (              t              )                                =                                    x              1                        ⁡                          (                              t                -                                                                            d                      ⁢                                                                                          ⁢                      sin                      ⁢                                                                                          ⁢                      θ                                        C                                    ⁢                  F                                            )                                      ⁢                                  ⁢                  t          ⁢                      :                    ⁢                                          ⁢          sample          ⁢                                          ⁢          index                ⁢                                  ⁢                  d          ⁢                      :                    ⁢                                          ⁢          microphone          ⁢                                          ⁢          spacing                ⁢                                  ⁢                  C          ⁢                      :                    ⁢                                          ⁢          sound          ⁢                                          ⁢          velocity                ⁢                                  ⁢                  F          ⁢                      :                    ⁢                                          ⁢          sampling          ⁢                                          ⁢          frequency                                    [        1.1        ]            
In the above-mentioned equation, t denotes the discrete time, and indicates the sample index. Since the observed signal at the microphone 2: x2(t) lags in phase by an amount corresponding to the greater distance from the sound source than x1(t), ignoring attenuation, x2(t) can be represented as Equation [1.1]
Therefore, if a phase lag can be detected on the basis of the observed signals at the respective microphones, the angle [θ] that determines the sound source direction can be computed in accordance with the above-mentioned equation. In the following, a description will be given of a method of computing the sound source direction by transforming observed signals into the time-frequency domain.
Let X1(ω, t) and X2(ω, t) represent the results of applying a Short-time Fourier Transform (STFT) described later to a plurality of observed signals x1 and x2, respectively, where ω and t denote the frequency bin index and the frame index, respectively. The observed signals [x1, x2] before undergoing the transform are referred to as time domain signals, and the signals [X1, X2] after undergoing the short-time Fourier transform are referred to as time-frequency domain (or STFT domain) signals.
Since a phase lag in the time domain corresponds to a complex number multiple in the time-frequency domain, the relational expression [1.1] in the time domain can be represented as Equation [1.2] below.
                                                        X              2                        ⁡                          (                              ω                ,                t                            )                                =                                    exp              ⁡                              (                                                      -                    jπ                                    ⁢                                                            ω                      -                      1                                                              M                      -                      1                                                        ⁢                                                            d                      ⁢                                                                                          ⁢                      sin                      ⁢                                                                                          ⁢                      θ                                        C                                    ⁢                  F                                )                                      ⁢                                          X                1                            ⁡                              (                                  ω                  ,                  t                                )                                                    ⁢                                  ⁢                  t          ⁢                      :                    ⁢                                          ⁢          frame          ⁢                                          ⁢          index                ⁢                                  ⁢                  ω          ⁢                      :                    ⁢                                          ⁢          frequency          ⁢                                          ⁢          bin          ⁢                                          ⁢          index                ⁢                                  ⁢                  M          ⁢                      :                    ⁢                                          ⁢          total          ⁢                                          ⁢          number          ⁢                                          ⁢          of          ⁢                                          ⁢          frequency          ⁢                                          ⁢          bins                ⁢                                  ⁢                  j          ⁢                      :                    ⁢                                          ⁢          imaginary          ⁢                                          ⁢          unit                                    [        1.2        ]            
To extract the term containing the angle [θ] indicating the sound source direction, an operation represented as Equation [1.3] may be performed.
                                                                        angle                ⁡                                  (                                                                                    X                        1                                            ⁡                                              (                                                  ω                          ,                          t                                                )                                                                                                            X                        2                                            ⁡                                              (                                                  ω                          ,                          t                                                )                                                                              )                                            =                            ⁢                              angle                ⁡                                  (                                                                                    X                        1                                            ⁡                                              (                                                  ω                          ,                          t                                                )                                                              ⁢                                                                                            X                          2                                                ⁡                                                  (                                                      ω                            ,                            t                                                    )                                                                    _                                                        )                                                                                                        =                            ⁢                              π                ⁢                                                      ω                    -                    1                                                        M                    -                    1                                                  ⁢                                                      d                    ⁢                                                                                  ⁢                    sin                    ⁢                                                                                  ⁢                    θ                                    C                                ⁢                F                                                                        [        1.3        ]            
In the above-mentioned equation, angle( ) denotes a function for finding the argument of a complex number within the range of −π to +π, and X2 with an overline denotes a complex conjugate of X2. Lastly, the sound source direction can be estimated by Equation [1.4] below.
                                          θ            ^                    ⁡                      (            ω            )                          =                  asin          ⁡                      (                                                                                (                                          M                      -                      1                                        )                                    ⁢                  C                                                                      π                    ⁡                                          (                                              ω                        -                        1                                            )                                                        ⁢                  dF                                            ⁢                                                          ⁢              angle              ⁢                                                          ⁢                              (                                                                            X                      1                                        ⁡                                          (                                              ω                        ,                        t                                            )                                                        ⁢                                                                                    X                        2                                            ⁡                                              (                                                  ω                          ,                          t                                                )                                                              _                                                  )                                      )                                              [        1.4        ]            
In this equation, a sin denotes the inverse function of sin.
Also, hat(θ(ω)) means that the angle θ in the frequency bin ω is a value estimated from an observed value. It should be noted that hat as used herein means symbol (^).
While Equation [1.4] mentioned above is for a given specific ω (frequency bin index) and t (frame index), by calculating the sound source directions with respect to a plurality of ω's and t's and then taking the mean, a stable value of θ can be obtained. In addition, it is also possible to prepare n (n>2) microphones, and calculate the sound source direction with respect to each of n(n+1)/2 pairs.
Equations [1.5] to [1.7] below are equations for cases in which a plurality of microphones and a plurality of frames are used.
                              X          ⁡                      (                          ω              ,              t                        )                          =                  [                                                                                          X                    1                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                                      ⋮                                                                                                          X                    n                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                ]                                    [        1.5        ]                                                      E            t                    ⁡                      [                          X              ⁡                              (                                  ω                  ,                  t                                )                                      ]                          =                  [                                                                                          E                    t                                    ⁡                                      [                                                                  X                        1                                            ⁡                                              (                                                  ω                          ,                          t                                                )                                                              ]                                                                                                      ⋮                                                                                                          E                    t                                    ⁡                                      [                                                                  X                        n                                            ⁡                                              (                                                  ω                          ,                          t                                                )                                                              ]                                                                                ]                                    [        1.6        ]                                                      E            t                    ⁡                      [                                          X                k                            ⁡                              (                                  ω                  ,                  t                                )                                      ]                          =                              1                                          T                1                            -                              T                0                            +              1                                ⁢                                    ∑                              t                =                                  T                  0                                                            T                1                                      ⁢                                          X                k                            ⁡                              (                                  ω                  ,                  t                                )                                                                        [        1.7        ]            
A vector whose elements are the observed signals at individual microphones is defined as Equation [1.5], and the mean of the vector is defined by Equation [1.6]. It should be noted that Et[ ] in this equation denotes the mean based on a frame in a given segment, and is defined by Equation [1.7] In Equation [1.4], by using Et[Xi((ω, t)] and Et[Xm(ω, t)] in Equation [1.6] instead of X1(ω, t) and X2(ω, t), the mean of source direction corresponding to t0-th to t1-th frames, which is calculated from a pair of i-th microphone and m-th microphone, is found.
Since this sound source estimation method using a phase difference involves less processing cost than that in the case of a scheme using ICA described later, the sound source direction can be computed in real-time (with a delay of less than one frame) and with high frequency (frame by frame).
On the other hand, in environments in which a plurality of sound sources are playing simultaneously, it is not possible to find the sound source direction. In addition, even when a single sound source is present, in environments where large reflections and reverberations exist, the accuracy of the direction decreases.
[A2. Description of ICA]
Now, before describing a sound source estimation method using Independent Component Analysis (ICA), the ICA itself will be described first.
ICA refers to a type of multivariate analysis, and is a technique for separating multidimensional signals by exploiting statistical properties of the signals. For details on ICA itself, reference should be made to, for example, “Introduction to the Independent Component Analysis” (Noboru Murata, Tokyo Denki University Press).
Hereinbelow, a description will be given of ICA for sound signals, in particular, ICA in the time-frequency domain.
As shown in FIG. 2, a situation is considered in which different sounds are being played from N sound sources, and those sounds are observed at n microphones. The sounds (source signals) produced from the sound sources are subject to time lags, reflections, and so on before arriving at the microphones. Therefore, signals observed at a microphone j (observed signals) can be represented as an equation that sums up convolutions between source signals and transfer functions with respect to all sound sources as indicated by Equation [2.1] below. Hereinafter, these mixtures will be referred to as “convolutive mixtures”.
                                          x            k                    ⁡                      (            t            )                          =                                            ∑                              j                =                1                            N                        ⁢                                          ∑                                  l                  =                  0                                L                            ⁢                                                                    a                    kj                                    ⁡                                      (                    l                    )                                                  ⁢                                                      s                    j                                    ⁡                                      (                                          t                      -                      l                                        )                                                                                =                                    ∑                              j                =                1                            N                        ⁢                          {                                                a                  kj                                *                                  s                  j                                            }                                                          [        2.1        ]            
Observed signals for all microphones can be represented by a single equation as in Equation [2.2] below.
                                          x            ⁡                          (              t              )                                =                                                    A                                  [                  0                  ]                                            ⁢                              s                ⁡                                  (                  t                  )                                                      +            …            +                                          A                                  [                  L                  ]                                            ⁢                              s                ⁡                                  (                                      t                    -                    L                                    )                                                                    ⁢                                  ⁢        where        ⁢                                  ⁢                                            s              ⁢                              (                t                )                                      =                          [                                                                                                                  s                        1                                            ⁡                                              (                        t                        )                                                                                                                                  ⋮                                                                                                                                      s                        N                                            ⁡                                              (                        t                        )                                                                                                        ]                                ,                                          ⁢                                    x              ⁡                              (                t                )                                      =                          [                                                                                                                  x                        1                                            ⁡                                              (                        t                        )                                                                                                                                  ⋮                                                                                                                                      x                        n                                            ⁡                                              (                        t                        )                                                                                                        ]                                ,                                          ⁢                                    A                              [                l                ]                                      =                          [                                                                                                                  a                        11                                            ⁡                                              (                        l                        )                                                                                                  …                                                                                                      a                                                  1                          ⁢                          N                                                                    ⁡                                              (                        l                        )                                                                                                                                  ⋮                                                        ⋱                                                        ⋮                                                                                                                                      a                                                  n                          ⁢                                                                                                          ⁢                          1                                                                    ⁡                                              (                        l                        )                                                                                                  …                                                                                                      a                        nN                                            ⁡                                              (                        l                        )                                                                                                        ]                                                          [        2.2        ]            
Here, x(t) and s(t) are column vectors having xk(t) and sk(t) as elements, respectively. A[1] is an n×N matrix having a[1]kj as elements. In the following description, it is assumed that n=N.
It is a common knowledge that convolutive mixtures in the time domain are represented by instantaneous mixtures in the time-frequency domain. An analysis that exploits this characteristic is ICA in the time-frequency domain.
Concerning the time-frequency domain ICA itself, reference should be made to, for example, Japanese Unexamined Patent Application Publication No. 2005-49153 “19. 2. 4 Fourier Transform Methods” of “Explanation of Independent Component Analysis” and Japanese Unexamined Patent Application Publication No. 2006-238409 “AUDIO SIGNAL SEPARATING APPARATUS/NOISE REMOVAL APPARATUS AND METHOD”).
Hereinbelow, features that are relevant to the present invention will be mainly described.
Application of a short-time Fourier transform on both sides of Equation [2.2] mentioned above yields Equation [3.1] below.
                              X          ⁡                      (                          ω              ,              t                        )                          =                              A            ⁡                          (              ω              )                                ⁢                      S            ⁡                          (                              ω                ,                t                            )                                                          [        3.1        ]                                          X          ⁡                      (                          ω              ,              t                        )                          =                  [                                                                                          X                    1                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                                      ⋮                                                                                                          X                    n                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                ]                                    [        3.2        ]                                          A          ⁡                      (            ω            )                          =                  [                                                                                          A                    11                                    ⁡                                      (                    ω                    )                                                                              …                                                                                  A                                          1                      ⁢                      N                                                        ⁡                                      (                    ω                    )                                                                                                      ⋮                                            ⋱                                            ⋮                                                                                                          A                                          n                      ⁢                                                                                          ⁢                      1                                                        ⁡                                      (                    ω                    )                                                                              …                                                                                  A                    nM                                    ⁡                                      (                    ω                    )                                                                                ]                                    [        3.3        ]                                          S          ⁡                      (                          ω              ,              t                        )                          =                  [                                                                                          S                    1                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                                      ⋮                                                                                                          S                    N                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                ]                                    [        3.4        ]                                          Y          ⁡                      (                          ω              ,              t                        )                          =                              W            ⁡                          (              ω              )                                ⁢                      X            ⁡                          (                              ω                ,                t                            )                                                          [        3.5        ]                                          Y          ⁡                      (                          ω              ,              t                        )                          =                  [                                                                                          Y                    1                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                                      ⋮                                                                                                          Y                    n                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                ]                                    [        3.6        ]                                          W          ⁡                      (            ω            )                          =                  [                                                                                          W                    11                                    ⁡                                      (                    ω                    )                                                                              …                                                                                  W                                          1                      ⁢                      n                                                        ⁡                                      (                    ω                    )                                                                                                      ⋮                                            ⋱                                            ⋮                                                                                                          W                                          n                      ⁢                                                                                          ⁢                      1                                                        ⁡                                      (                    ω                    )                                                                              …                                                                                  W                    nn                                    ⁡                                      (                    ω                    )                                                                                ]                                    [        3.7        ]            
In Equation [3.1], ω is the frequency bin index and t is the frame index.
If ω is fixed, this equation can be regarded as instantaneous mixtures (mixtures with no time lag). Accordingly, to separate observed signals, Equation [3.5] for computing the separated results [Y] is prepared, and then a separating matrix W(ω) is determined such that the individual components of the separated results: Y(ω, t) are maximally independent.
In the case of time-frequency domain ICA according to the related art, a so-called permutation problem occurs, in which “which component is separated into which channel” differs for each frequency bin. This permutation problem was almost entirely solved by the configuration disclosed in Japanese Unexamined Patent Application Publication No. 2006-238409 “AUDIO SIGNAL SEPARATING APPARATUS/NOISE REMOVAL APPARATUS AND METHOD”, which is a patent application previously filed by the same inventor as the present application. Since this method is also employed in an embodiment of the present invention, a brief description will be given of the technique for solving the permutation problem disclosed in Japanese Unexamined Patent Application Publication No. 2006-238409.
In Japanese Unexamined Patent Application Publication No. 2006-238409, to find a separating matrix W(ω), Equations [4.1] to [4.3] below are iterated until the separating matrix W(ω) converges (or a predetermined number of times).
                                          Y            ⁡                          (                              ω                ,                t                            )                                =                                    W              ⁡                              (                ω                )                                      ⁢                          X              ⁡                              (                                  ω                  ,                  t                                )                                                    ⁢                                  ⁢                  (                                    t              =              1                        ,            …            ⁢                                                  ,                                                            T                                                                                            ω                      =                      1                                        ,                    …                    ⁢                                                                                  ,                    M                                                                                )                                    [        4.1        ]                                          Δ          ⁢                                          ⁢                      W            ⁡                          (              ω              )                                      =                              {                          I              +                                                E                  t                                ⁡                                  [                                                                                    φ                        ω                                            ⁡                                              (                                                  Y                          ⁡                                                      (                            t                            )                                                                          )                                                              ⁢                                                                  Y                        ⁡                                                  (                                                      ω                            ,                            t                                                    )                                                                    H                                                        ]                                                      }                    ⁢                      W            ⁡                          (              ω              )                                                          [        4.2        ]                                          W          ⁡                      (            ω            )                          ←                              W            ⁡                          (              ω              )                                +                      η            ⁢                                                  ⁢            Δ            ⁢                                                  ⁢                          W              ⁡                              (                ω                )                                                                        [        4.3        ]                                          Y          ⁡                      (            t            )                          =                              [                                                                                                      Y                      1                                        ⁡                                          (                                              1                        ,                        t                                            )                                                                                                                    ⋮                                                                                                                        Y                      1                                        ⁡                                          (                                              M                        ,                        t                                            )                                                                                                                    ⋮                                                                                                                        Y                      n                                        ⁡                                          (                                              1                        ,                        t                                            )                                                                                                                    ⋮                                                                                                                        Y                      n                                        ⁡                                          (                                              M                        ,                        t                                            )                                                                                            ]                    =                      [                                                                                                      Y                      1                                        ⁡                                          (                      t                      )                                                                                                                    ⋮                                                                                                                        Y                      n                                        ⁡                                          (                      t                      )                                                                                            ]                                              [        4.4        ]                                                      φ            ω                    ⁡                      (                          Y              ⁡                              (                t                )                                      )                          =                  [                                                                                          φ                    ω                                    ⁡                                      (                                                                  Y                        1                                            ⁡                                              (                        t                        )                                                              )                                                                                                      ⋮                                                                                                          φ                    ω                                    ⁡                                      (                                                                  Y                        n                                            ⁡                                              (                        t                        )                                                              )                                                                                ]                                    [        4.5        ]                                                      φ            ω                    ⁡                      (                                          Y                k                            ⁡                              (                t                )                                      )                          =                              ∂                          ∂                                                Y                  k                                ⁡                                  (                                      ω                    ,                    t                                    )                                                              ⁢          log          ⁢                                          ⁢                      P            ⁡                          (                                                Y                  k                                ⁡                                  (                  t                  )                                            )                                                          [        4.6        ]                                                      P            ⁡                          (                                                Y                  k                                ⁡                                  (                  t                  )                                            )                                ⁢                      :                    ⁢                                          ⁢          probability          ⁢                                          ⁢          density          ⁢                                          ⁢          function          ⁢                                          ⁢                      (                          P              ⁢                                                          ⁢              D              ⁢                                                          ⁢              F                        )                    ⁢                                          ⁢          of          ⁢                                          ⁢                                    Y              k                        ⁡                          (              t              )                                      ⁢                                  ⁢                              P            ⁡                          (                                                Y                  k                                ⁡                                  (                  t                  )                                            )                                ∝                      exp            ⁡                          (                                                -                  γ                                ⁢                                                                                                                        Y                        k                                            ⁡                                              (                        t                        )                                                                                                  2                                            )                                                          [        4.7        ]                                                                                                Y                k                            ⁡                              (                t                )                                                          m                =                              {                                          ∑                                  ω                  =                  1                                M                            ⁢                                                                                                            Y                      k                                        ⁡                                          (                                              ω                        ,                        t                                            )                                                                                        m                                      }                                1            /            m                                              [        4.8        ]                                                      φ            ω                    ⁡                      (                                          Y                k                            ⁡                              (                t                )                                      )                          =                              -            γ                    ⁢                                                    Y                k                            ⁡                              (                                  ω                  ,                  t                                )                                                                                                                          Y                    k                                    ⁡                                      (                    t                    )                                                                              2                                                          [        4.9        ]            
In the following, such iteration will be referred to as “learning”. It should be noted, however, that Equations [4.1] to [4.3] are performed with respect to all frequency bins, and further, Equation [4.1] is performed with respect to all the frames of charged observed signals. In addition, in Equation [4.2], Et[·] denotes the mean over all frames. The superscript H attached at the upper right of Y(ω, t) indicates the Hermitian transpose (taking the transpose of a vector or a matrix, and also transforming its elements into conjugate complex numbers).
The separated results Y(t) are represented by Equation [4.4]. Also, φω((Y(t)), which denotes a vector in which elements of all the channels and all the frequency bins of the separated results are arranged, is represented by Equation [4.5]. Each element φ107 (Yk(t)) is called a score function, and is a logarithmic derivative of the multidimensional (multivariate) probability density function (PDF) of Yk(t) (Equation [4.6]). As the multidimensional PDF, for example, a function represented by Equation [4.7] can be used, in which case the score function φω(Yk(t)) can be represented as Equation [4.9]. It should be noted, however, that ∥Yk(t)∥2 is an L-2 norm (obtained by finding the square sum of all elements and then taking the square root of the resulting sum). An L-m as a generalized form of the L-2 norm is defined by Equation [4.8]. In Equation [4.7] and Equation [4.9], γ denotes a term for adjusting the scale of Yk(ω, t), for which an appropriate positive constant, for example, sqrt(M) (square root of the number of frequency bins) is substituted. In Equation [4.3], η is a positive small value (for example, about 0.1) called a learning ratio or learning coefficient. This is used for gradually reflecting ΔW(ω) calculated in Equation [4.2] on the separating matrix W(ω).
While Equation [4.1] represents separation in one frequency bin (see FIG. 3A), it is also possible to represent separation in all frequency bins by a single equation (see FIG. 3B).
This may be accomplished by using the separated results Y(t) in all frequency bins represented by Equation [4.4] described above, and observed signals X(t) represented by Equation [4.11] below, and further the separating matrices for all frequency bins represented by Equation [4.10]. By using those vectors and matrices, separation can be represented by Equation [4.12]. According to an embodiment of the present invention, Equation [4.1] and Equation [4.11] are used selectively as necessary.
                                              ⁢                  W          =                      [                                                                                                      W                      11                                        ⁡                                          (                      1                      )                                                                                                                                                                                0                                                                                                                                                                                    W                                              1                        ⁢                        n                                                              ⁡                                          (                      1                      )                                                                                                                                                                                0                                                                                                                                                                      ⋱                                                                                                                                          …                                                                                                                                          ⋱                                                                                                                                                                      0                                                                                                                                                                                    W                      11                                        ⁡                                          (                      M                      )                                                                                                                                                                                0                                                                                                                                                                                    W                                              1                        ⁢                        n                                                              ⁡                                          (                      M                      )                                                                                                                                                                                                            ⋮                                                                                                                                          ⋱                                                                                                                                          ⋮                                                                                                                                                                                                                W                                              n                        ⁢                                                                                                  ⁢                        1                                                              ⁡                                          (                      1                      )                                                                                                                                                                                0                                                                                                                                                                                    W                      nn                                        ⁡                                          (                      1                      )                                                                                                                                                                                0                                                                                                                                                                      ⋱                                                                                                                                          …                                                                                                                                          ⋱                                                                                                                                                                      0                                                                                                                                                                                    W                                              n                        ⁢                                                                                                  ⁢                        1                                                              ⁡                                          (                      M                      )                                                                                                                                                                                0                                                                                                                                                                                    W                      nn                                        ⁡                                          (                      M                      )                                                                                            ]                                              [        4.10        ]                                                          ⁢                              X            ⁡                          (              t              )                                =                      [                                                                                                      X                      1                                        ⁡                                          (                                              1                        ,                        t                                            )                                                                                                                    ⋮                                                                                                                        X                      1                                        ⁡                                          (                                              M                        ,                        t                                            )                                                                                                                    ⋮                                                                                                                        X                      n                                        ⁡                                          (                                              1                        ,                        t                                            )                                                                                                                    ⋮                                                                                                                        X                      n                                        ⁡                                          (                                              M                        ,                        t                                            )                                                                                            ]                                              [        4.11        ]                                                          ⁢                              Y            ⁡                          (              t              )                                =                      WX            ⁡                          (              t              )                                                          [        4.12        ]                                                          ⁢                              Δ            ⁢                                                  ⁢                          W              ⁡                              (                ω                )                                              =                                    {                              I                -                                                      E                    t                                    ⁡                                      [                                                                  Y                        ⁡                                                  (                                                      ω                            ,                            t                                                    )                                                                    ⁢                                                                        Y                          ⁡                                                      (                                                          ω                              ,                              t                                                        )                                                                          H                                                              ]                                                  +                                                      E                    t                                    ⁡                                      [                                                                                            φ                          ω                                                ⁡                                                  (                                                      Y                            ⁡                                                          (                              t                              )                                                                                )                                                                    ⁢                                                                        Y                          ⁡                                                      (                                                          ω                              ,                              t                                                        )                                                                          H                                                              ]                                                  -                                                      E                    t                                    ⁡                                      [                                                                  Y                        ⁡                                                  (                                                      ω                            ,                            t                                                    )                                                                    ⁢                                                                                                    φ                            ω                                                    ⁡                                                      (                                                          Y                              ⁡                                                              (                                t                                )                                                                                      )                                                                          H                                                              ]                                                              }                        ⁢                          W              ⁡                              (                ω                )                                                                        [        4.13        ]                                                          ⁢                                            E              t                        ⁡                          [                                                Y                  ⁡                                      (                                          ω                      ,                      t                                        )                                                  ⁢                                                      Y                    ⁡                                          (                                              ω                        ,                        t                                            )                                                        H                                            ]                                =                                    W              ⁡                              (                ω                )                                      ⁢                                          E                t                            ⁡                              [                                                      X                    ⁡                                          (                                              ω                        ,                        t                                            )                                                        ⁢                                                            X                      ⁡                                              (                                                  ω                          ,                          t                                                )                                                              H                                                  ]                                      ⁢                                          W                ⁡                                  (                  ω                  )                                            H                                                          [        4.14        ]                                                          ⁢                                            E              t                        ⁡                          [                                                Y                  ⁡                                      (                                          ω                      ,                      t                                        )                                                  ⁢                                                                            φ                      ω                                        ⁡                                          (                                              Y                        ⁡                                                  (                          t                          )                                                                    )                                                        H                                            ]                                =                                                    E                t                            ⁡                              [                                                                            φ                      ω                                        ⁡                                          (                                              Y                        ⁡                                                  (                          t                          )                                                                    )                                                        ⁢                                                            Y                      ⁡                                              (                                                  ω                          ,                          t                                                )                                                              H                                                  ]                                      H                                              [        4.15        ]                                          Y          ⁡                      (                          ω              ,              t                        )                          ←                              diag            (                                          1                                                                            E                      t                                        ⁡                                          [                                                                                                                                                            Y                              1                                                        ⁡                                                          (                                                              ω                                ,                                t                                                            )                                                                                                                                2                                            ]                                                                                  ,              …              ⁢                                                          ,                              1                                                                            E                      t                                        ⁡                                          [                                                                                                                                                            Y                              n                                                        ⁡                                                          (                                                              ω                                ,                                t                                                            )                                                                                                                                2                                            ]                                                                                            )                    ⁢                      Y            ⁡                          (                              ω                ,                t                            )                                                          [        4.16        ]                                          W          ⁡                      (            ω            )                          ←                              diag            (                                          1                                                                            E                      t                                        ⁡                                          [                                                                                                                                                            Y                              1                                                        ⁡                                                          (                                                              ω                                ,                                t                                                            )                                                                                                                                2                                            ]                                                                                  ,              …              ⁢                                                          ,                              1                                                                            E                      t                                        ⁡                                          [                                                                                                                                                            Y                              n                                                        ⁡                                                          (                                                              ω                                ,                                t                                                            )                                                                                                                                2                                            ]                                                                                            )                    ⁢                      W            ⁡                          (              ω              )                                                          [        4.17        ]            
The diagrams of X1 to Xn and Y1 to Yn shown in FIGS. 3A and 3B are called spectrograms, in which the results of short-time Fourier transform (STFT) are arranged in the frequency bin direction and the frame direction. The vertical direction represents the frequency bin, and the horizontal direction represents the frame. While lower frequencies are drawn at the top in Equation [4.4] and Equation [4.11], lower frequencies are drawn at the bottom in the spectrograms.
ICA is a type of Blind Source Separation (BSS), and has such characteristics that in the separation process, no knowledge about signal sources (the frequency distributions, the directions of arrival, and the like of sound sources) is necessary, and separation can be performed even without any prior knowledge about the sound velocity, microphone spacing, microphone gain, transfer function, and the like. Further, an assumption other than independence (for example, sparseness or the like) is also unnecessary, and separation is possible even when a plurality of sound sources keep playing and mixtures are occurring in the frequency domain (that is, even when the assumption of sparseness does not hold). These characteristics prove advantageous for the scheme that involves “sound source separation after direction estimation” described later.
[A3. Sound Source Estimation Using ICA]
A scheme for estimating the sound source direction by using ICA is described in, for example, Japanese Patent No. 3881367.
Hereinbelow, a description will be given of a method of finding the sound source direction from the separating matrix W(ω) disclosed in Japanese Patent No. 3881367 mentioned above. The separating matrix W(ω) obtained after learning has converged represents a process reverse to a mixing matrix A(ω). Therefore, if W(ω)−1 as the inverse matrix of W(ω) is found, this can be considered as indicating a transfer function corresponding to a frequency bin ω.
As shown in Equation [5.1] below, let the inverse matrix of the separating matrix W(ω) be B(ω), and further, individual elements of B(ω) be defined as follows.
                              B          ⁡                      (            ω            )                          =                              W            ⁡                          (              ω              )                                            -            1                                              [        5.1        ]                                          B          ⁡                      (            ω            )                          =                  [                                                                                          B                    11                                    ⁡                                      (                    ω                    )                                                                              …                                                                                  B                                          1                      ⁢                      n                                                        ⁡                                      (                    ω                    )                                                                                                      ⋮                                            ⋱                                            ⋮                                                                                                          B                                          n                      ⁢                                                                                          ⁢                      1                                                        ⁡                                      (                    ω                    )                                                                              …                                                                                  B                    nn                                    ⁡                                      (                    ω                    )                                                                                ]                                    [        5.2        ]                                                      X            mk                    ⁡                      (                          ω              ,              t                        )                          =                                            B              mk                        ⁡                          (              ω              )                                ⁢                                    Y              k                        ⁡                          (                              ω                ,                t                            )                                                          [        5.3        ]                                                                    θ              ^                        imk                    ⁡                      (            ω            )                          =                  a          ⁢                                          ⁢                      sin            ⁡                          (                                                                                          (                                              M                        -                        1                                            )                                        ⁢                    C                                                                              π                      ⁡                                              (                                                  ω                          -                          1                                                )                                                              ⁢                                          d                      im                                        ⁢                    F                                                  ⁢                                                                  ⁢                angle                ⁢                                                                  ⁢                                  (                                                                                    B                        ik                                            ⁡                                              (                        ω                        )                                                              ⁢                                                                                            B                          mk                                                ⁡                                                  (                          ω                          )                                                                    _                                                        )                                            )                                                          [        5.4        ]            
Here, directing attention to the k-th row (horizontal array) of the separating matrix W(ω) and the k-th column (vertical array) of the inverse matrix B(ω) (see FIG. 4A), the k-th row of the separating matrix W(ω) is a vector that yields the separated results (=estimated sound sources) Yk(ω, t) for the k-th channel from a vector X(ω, t) of observed signals, whereas the k-th column of an estimated transfer function matrix B(ω) as its inverse matrix B(ω) represents transfer functions (see FIG. 4B) from the estimated sound sources Yk(ω, t) to individual microphones.
That is, of observed signals Xm(ω, t) observed at the m-th microphone, letting components derived from the sound sources Yk(ω, t) be written as component signals Xmk(ω, t), the component signals Xmk(ω, t) can be represented as Equation [5.3] below.Xmk(ω,t)=Bmk(ω)Yk(ω,t)  [5.3]
By using the inverse matrix B(ω) of the separating matrix W(ω), components of observed signals derived from individual sound sources can be found. Therefore, the sound source direction can be computed by the same method as in the case of “1. Sound Source Estimation using Phase Difference between Microphones” described at the beginning of this specification.
For example, to find hat(θimk(ω)) as the sound source direction of the separated results Yk(ω, t) by using the i-th microphone and the m-th microphone, component signals Xik(ω, t) and Xmk(ω, t) computed from Equation [5.3] mentioned above may be substituted into the short-time Fourier transformed observed signals x1(ω, t) and X2(ω, t) corresponding to the observed signals at two microphones in Equation [1.4] described above. As a result, Equation [5.4] below is obtained.
                                                        θ              ^                        imk                    ⁡                      (            ω            )                          =                  a          ⁢                                          ⁢                      sin            ⁡                          (                                                                                          (                                              M                        -                        1                                            )                                        ⁢                    C                                                                              π                      ⁡                                              (                                                  ω                          -                          1                                                )                                                              ⁢                                          d                      im                                        ⁢                    F                                                  ⁢                                                                  ⁢                angle                ⁢                                                                  ⁢                                  (                                                                                    B                        ik                                            ⁡                                              (                        ω                        )                                                              ⁢                                                                                            B                          mk                                                ⁡                                                  (                          ω                          )                                                                    _                                                        )                                            )                                                          [        5.4        ]            
In this equation, dim denotes the distance between the i-th microphone and the m-th microphone.
It should be noted that in Equation [5.4] mentioned above, the frame index [t] has disappeared. That is, as long as the separated results Y(ω, t)=W(ω)X(ω, t) are calculated using the same separating matrix W(ω), the same sound source direction is computed for any frame t.
Unlike in the case of “1. Sound Source Estimation using Phase Difference between Microphones”, the above-described method of computing the sound source direction from the inverse matrix of a separating matrix of ICA has an advantage in that even when a plurality of sound sources are simultaneously playing, the directions of individual sound sources can be found.
In addition, there is also an advantage in that since the separated results and the sound source direction are both calculated from the separating matrix W(ω), the correspondence in terms of which channel of the separated results is associated with which direction is established from the beginning. That is, in a case where sound source separation means and sound source direction estimating means other than ICA are used in combination as multiple-source direction estimating means, depending on the combination of schemes, it is necessary to establish the correspondence between channels separately in subsequent stages, and there is a possibility that an error may occur at that time. In schemes using ICA, however, it is unnecessary to establish such correspondence in subsequent stages.
As for the sound velocity C, microphone spacing dim, and the like which appear in Equation [5.4], their values on the equation differ from those in the actual environment in some cases. In such cases, in schemes using ICA, although an estimation error occurs in the sound source direction itself, the error in direction does not affect the accuracy of sound source separation. In addition, microphone gain is subject to individual differences among microphones, and depending on the sound source separation or direction estimation scheme, such variations in gain adversely affect the accuracy of separation or estimation in some cases. In schemes using ICA, however, variations in microphone gain affect neither the separation performance nor direction estimation.
[A4. Real-Time Implementation of ICA]
The learning process descried in the section “A2. Description of ICA”, in which Equation [4.1] to Equation [4.3] are iterated until the separating matrix W(ω) converges (or a predetermined number of times), is performed in batch. That is, as described above, the iteration process of Equation [4.1] to Equation [4.3], in which Equation [4.1] to Equation [4.3] are iterated after charging the whole of observed signals, is referred to as learning.
This batch process can be applied to real-time (low-delay) sound source separation through some contrivance. As an example of a sound source separation process realizing a real-time processing scheme, a description will be given of the configuration disclosed in “Japanese Patent Application No. 2006-331823: REAL-TIME SOUND SOURCE SEPARATION APPARATUS AND METHOD”, which is a patent application previously filed by the same inventor as the present application.
As shown in FIG. 5, the processing scheme disclosed in Japanese Patent Application No. 2006-331823 splits observed signal spectrograms into a plurality of overlapped blocks 1 to N, and learning is performed for each block to find a separating matrix. The reason why the blocks are overlapped is to achieve both the accuracy and the frequency of updates of the separating matrix.
In the case of real-time ICA (blockwise ICA) disclosed prior to Japanese Patent Application No. 2006-331823, blocks are not overlapped. Therefore, to shorten the update interval of the separating matrix, it is necessary to shorten the block length (=the time for which observed signals are charged). However, there is a problem in that a shorter block length results in lower separation accuracy.
A separating matrix found from each block is applied to subsequent observed signals (not applied to the same block) to generate the separated results. Herein, such a scheme will be referred to as “staggered application”.
FIG. 6 illustrates the “staggered application”. Suppose that at the current time, observed signals in the t-th frame X(t)62 are inputted. At this point, the separating matrix corresponding to the block containing the observed signals X(t) (for example, an observed signal block 66 containing the current time) has not been obtained yet. Accordingly, instead of the block 66, the observed signals X(t) are multiplied by the separating matrix learned from a learning data block 61 that is a block preceding the block 66, thereby generating the separated results corresponding to X(t), that is, separated results Y(t)64 at the current time. It is assumed that the separating matrix learned from the learning data block 61 is already obtained at the point in time of the frame t.
As described above, a separating matrix is considered to represent a process reverse to a mixing process.
Hence, if the mixing process is the same (for example, if the positional relation between sound sources and microphones has not changed) between the observed signals in the learning data block setting segment 61 and the observed signals 62 at the current time, signal separation can be performed even when a separating matrix learned in a different segment is applied, thereby making it possible to realize separation with little delay.
A process of obtaining a separating matrix from overlapped blocks can be executed, for example, on a general PC by a computation according to a predetermined program. The configuration disclosed in Japanese Patent Application No. 2006-331823 proposes a scheme in which a plurality of processing units called threads for finding a separating matrix from overlapped blocks are run in parallel at staggered times. This parallel processing scheme will be described with reference to FIG. 7.
FIG. 7 shows the transitions of processing over time of individual threads serving as the units of processing. FIG. 7 shows six threads, Threads 1 to 6. Each thread repeats three states of A) Charging, B) Learning, and C) Waiting. That is, the thread length corresponds to the total time length of the three processes of A) Charging, B) Learning, and C) Waiting. Time progresses from left to right in FIG. 7.
The “A) Charging” is the segment of dark gray in FIG. 7. When in this state, a thread charges observed signals. The overlapped blocks in FIG. 5 can be expressed by staggering the charging start times between threads. Since the charging start time is staggered by ¼ of the charging time in FIG. 7, assuming that the charging time in one thread is, for example, four seconds, the staggered time between threads equals one second.
Upon charging observed signals for a predetermined time (for example, four seconds), each thread transitions in state to “B) Learning”. The “B) Learning” is the segment of light gray in FIG. 7. When in this state, Equations [4.1] and [4.3] described above are iterated with respect to the charged observed signals.
Once the separating matrix W has sufficiently converged (or simply upon reaching a predetermined number of iterations) by learning (iteration of Equations [4.1] to [4.3]), the learning is ended, and the thread transitions to the “C) Waiting” state (the white segment in FIG. 7). The “Waiting” is provided for keeping the charging start time and the learning start time at a constant interval between threads. As a result, the learning end time (=the time at which the separating matrix is updated) is also kept substantially constant.
The separating matrix W obtained by learning is used for performing separation until learning in the next thread is finished. That is, the separating matrix W is used as a separating matrix 63 shown in FIG. 6. A description will be given of the separating matrix used in each of applied-separating-matrix specifying segments 71 to 73 along the progression of time shown at the bottom of FIG. 7.
In the applied-separating-matrix specifying segment 71 from when the system is started to when the first separating matrix is learned, an initial value (for example, a unit matrix) is used as the separating matrix 63 in FIG. 6. In the segment 72 from when learning in Thread 1 is finished to when learning in Thread 2 is finished, a separating matrix derived from an observed-signal charging segment 74 in Thread 1 is used as the separating matrix 63 shown in FIG. 6. The numeral “1” shown in the segment 72 in FIG. 7 indicates that the separating matrix W used in this period is obtained through processing in Thread 1. The numerals on the right from the applied-separating-matrix specifying segment 72 also each indicate from which thread the corresponding separating matrix is derived.
If a separating matrix obtained in another thread exists at the point of starting learning, the separating matrix is used as the initial value of learning. This will be referred to as “inheritance of a separating matrix”. In the example shown in FIG. 7, at learning start timing 75 at which the first learning is started in Thread 3, the separating matrix 72 derived from Thread 1 is already obtained, so the separating matrix 72 is used as the initial value of learning.
By performing such processing, it is possible to prevent or reduce the occurrence of permutation between threads. Permutation between threads refers to, for example, a problem such that in the separating matrix obtained in the first thread, speech is outputted on the first channel and music is outputted on the second channel, whereas the reverse is true for the separating matrix obtained in the second thread.
As described above with reference to FIG. 7, permutation between threads can be reduced by performing “inheritance of a separating matrix” so that when a separating matrix that has been obtained in another thread exists, the separating matrix is used as the initial value of learning. In addition, even when a separating matrix has not sufficiently converged by learning in Thread 1, the degree of convergence can be improved as the separating matrix is inherited by the next thread.
By running a plurality of threads at staggered times in this way, the separating matrix is updated at an interval substantially equal to a shift between threads, that is, a block shift width 76.
[B. Problems of the Related Art]
When the “A3. Sound Source Direction Estimation using ICA” and “A4. Real-time Implementation of ICA” described above are combined, real-time sound source estimation adapted to multiple sound sources can be performed.
That is, it is possible to find the sound source direction by applying Equations [5.1] to [5.4] described above to each of separating matrices obtained from individual threads (such as the separating matrices used in the segments 72, 73, and so on shown in FIG. 7).
However, simply combining the two processes will give rise to several problems. Hereinbelow, these problems will be described from the following five viewpoints.
B1. Delay
B2. Tracking Lag
B3. Singularity of Separating Matrix and Zero Elements
B4. Difference in Purpose
B5. Flexibility of Parameter Settings
Hereinbelow, a description will be given with respect to each of the above viewpoints.
[B1. Delay]
Although a delay in sound source separation can be almost entirely eliminated by employing the “staggered application” described above with reference to FIG. 6, a delay in sound source direction estimation still remains as long as the direction is computed from a separating matrix. That is, in FIG. 6, the separated results corresponding to the observed signals X(t) 62 at the current time are obtained by multiplying X(t) by the separating matrix 63. That is, as the separated results 64 at the current time in FIG. 6, the following separated results are obtained.Y(t)=WX(t)
However, the sound source direction computed from the separating matrix 63 corresponds to the learning data block 61 that is a preceding segment. In other words, the sound source direction corresponding to the observed signals 62 at the current time is obtained at the point in time when a new separating matrix is learned from the block 66 containing the current time.
Assuming that the sound source direction is the same between the segment of the learning data block 61 and the current time, and employing the sound source direction computed from the separating matrix 63 as the sound source direction of Y(t), it appears that no delay occurs (this process will be referred to as “shifting of time point”). However, when the sound sources have changed between the segment of the learning data block 61 and the current time (when the sound sources have moved or started playing suddenly), a mismatch occurs between the separating matrix and the observed signals, and thus an inaccurate sound source direction is given. Moreover, it is not possible to determine if such a mismatch has occurred from the separating matrix alone.
In addition, when the “staggered application” is performed, for the same reason as that for the occurrence of a delay, strict correspondence is no longer established between the sound source direction and the separated results. For example, the separated results strictly corresponding to the sound source direction computed from the separating matrix (63) are generated by multiplying the observed signals in the segment of the learning data block 61 used for learning of the separating matrix, by the separating matrix 63. However, in the case of “staggered application”, multiplication between the separating matrix 63 and the observed signals in the learning data block 61 is not performed. Neither the separated results 64 at the current time, which are obtained as the product of the separating matrix 63 and the observed signals at the current time, nor data in a separated result spectrogram segment 65 having the same start and end times on the separated result side as those of the learning data block 61 strictly corresponds to the sound source direction computed from the separating matrix 63.
[B2. Tracking Lag]
When “staggered application” is employed, a mismatch occurs temporarily if the sound sources have changed (if the sound sources have moved or started playing suddenly) between the segment used for learning of a separating matrix (for example, the learning data block 61) and the observed signals 62 at the current time. Thereafter, as a separating matrix corresponding to the changed sound sources is obtained by learning, such a mismatch disappears eventually. This phenomenon will be herein referred to as “tracking lag”. Although tracking lag does not necessarily exert an adverse influence on the performance of sound source separation, even in that case, tracking lag sometimes exerts an adverse influence on the performance of direction estimation. Hereinbelow, a description will be given in this regard.
FIG. 8 is a conceptual diagram of tracking lag. Suppose that one sound source starts playing at given time (t) (for example, when someone suddenly starts talking). Other sound sources may either keep playing or remain silent before and after the time at which the sound source starts playing. With respect to the segments before and after the start of play, the degree of convergence of learning is shown in (a) Graph 1.
The degree of mismatch between the observed signals and the separating matrix is shown in (b) Graph 2.
First, (a) Graph 1 will be described. Provided that an inactive segment in which a sound source of interest is inactive (silent) has continued for a while, observed signals in that segment (for example, the separating matrix learned from an observation segment 81a) have converged sufficiently. This degree of convergence corresponds to convergence degree data 85a in (a) Graph 1.
Even when a sound source starts playing at time (t), due to the “staggered application”, a separating matrix learned in the “inactive” segment is used for some time. For example, a separating matrix learned in the “inactive” segment is used in a data segment 85b in (a) Graph 1 as well. The length of the data segment 85b is longer than the learning time (for example, the time length indicated at 77 in FIG. 7) and shorter or equal to the sum total of the block shift width (for example, the time length indicated at 76 in FIG. 7) and the learning time, and is determined by the balance between the timing at which learning starts and the timing at which a sound starts playing.
Thereafter, a separating matrix learned from an observation segment including the “sound-source active” segment begins to be used. For example, in a data segment 85c of the convergence degree data in (a) Graph 1, the separating matrix learned from the observation segment 81b is used. It should be noted, however, that if the number of learning loops is set small (for example, if learning is aborted in 20 loops to shorten the learning time, even through learning converges in 100 loops), the separating matrix has not converged completely. Therefore, the degree of convergence in the data segment 85c in (a) Graph 1 is low.
Thereafter, the separating matrix converges as the “inheritance of a separating matrix” (see “4. Real-time Implementation of ICA” described above) is repeated between threads. This occurs in the sequence of data segments 85c, 85d, 85e, and 85f of the convergence degree data in (a) Graph 1. It is assumed that the separating matrix has converged sufficiently in the data segment 85f and thereafter.
Next, a description will be given of (b) Graph 2 showing the degree of mismatch between the observed signals and the separating matrix. This is a conceptual graph indicating how much mismatch there is between a separating matrix and observed signals at the current time. The degree of mismatch becomes high in one of the following two cases.
(1) There has been a change in sound source between the segment used for learning of the separating matrix and the current time.
(2) The separating matrix has not completely converged.
Although the separating matrix has converged in a data segment 88b of the mismatch degree data, the degree of mismatch becomes high due to the reason (1). As the separating matrix gradually converges in data segments 88c to 88f, the degree of mismatch also decreases.
Even when a mismatch occurs, this does not necessarily exert an adverse influence on sound source separation. For example, when only one major sound source is playing, that is, when only about background noise is present in the observation segment 81 that is a sound-source inactive segment, there is no mixing of sounds in the first place, so even if a mismatch occurs in the separating matrix, this does not affect sound source separation. In addition, in cases where observed signals are separated into a target sound and an interference sound and only the target sound is used, if the interference sound is a continuously playing sound and the target sound is an intermittent sound (sound that starts playing suddenly), there is no influence on the separated results of the target sound. Specifically, the continuous sound keeps being outputted to only one of channels, irrespective of the presence/absence of another sound. On the other hand, when another sound source suddenly starts playing, although the sound is outputted to all output channels in the segment 88d, as the degree of mismatch becomes lower, the sound comes to be outputted to only one channel different from the channel of continuous sound. Therefore, if the interference sound is continuous, an output channel in which the interference sound is suppressed exists throughout the time both before and after the target sound plays, and if such a channel is selected, removal of interference sound succeeds.
On the other hand, with respect to sound source direction estimation, the mismatch between a separating matrix and observed signals exerts an adverse influence. For example, in the data segment 88b shown in FIG. 8, if the “shifting of time point” described in “2. Tracking Lag” is performed, an inaccurate direction (direction computed not from the sound-source active segment but the sound-source inactive segment) is given. In addition, since the separating matrix has not converged yet in the data segments 88c and 88d, there is a possibility that a separating matrix calculated from those segments is also inaccurate.
If only those cases are considered in which there is a single sound source, the sound source can be computed without tracking lag in the case of “1. Sound Source Estimation using Phase Difference between Microphones” described above. That is, employing “Real-time ICA plus Sound Source Estimation from Separating Matrix” in an environment where there is only a single sound source presents more problems than “1. Sound Source Estimation using Phase Difference between Microphones”.
[B3. Singularity of Separating Matrix and Zero Elements]
In the system disclosed in Japanese Patent No. 3881367 described in [A3. Sound Source Estimation using ICA”, the inverse matrix of a separating matrix is used. Thus, if the separating matrix is close to singular, that is, the determinant of the separating matrix is a value close to 0, an inverse matrix is not obtained properly, and there is a possibility of divergence. In addition, even if an inverse matrix exists, if any one of the elements of the matrix is 0, it is not possible for the function [angle( )] to return a valid value in Equation [5.4] described above.
For example, suppose that at system start-up, a unit matrix is substituted for the separating matrix (that is, a unit matrix is used in the segment 71 shown in FIG. 7). While an inverse matrix exists for a unit matrix (the unit matrix itself), all elements except the diagonal elements of the inverse matrix are 0. Thus, the sound source direction is not found from Equation [5.4] described above. Therefore, in the case of the scheme that finds the sound source direction from the inverse matrix of a separating matrix, the sound source direction is not found for the segment 71 in which a unit matrix is substituted for the separating matrix.
At times other than immediately after start-up as well, if the separating matrix becomes close to singular or if its inverse matrix contains zero elements, a correct sound source direction can no longer be found. In addition, if the number of channels (the number of microphones) increases, the complexity of the inverse matrix also increases. Thus, it would be desirable if a similar operation could be performed without using an inverse matrix.
[B4. Difference in Purpose]
In Japanese Patent No. 3881367 described in [A3. Sound Source Estimation using ICA” above, the direction is found from a separating matrix of ICA. The main purpose of this process is to overcome the permutation problem between frequency bins. That is, sound source directions are computed for individual frequency bins, and from those sound source directions, channels having substantially the same sound source direction are grouped between frequency bins, thereby aligning permutation between frequency bins.
After Japanese Patent No. 3881367, for example, in Japanese Unexamined Patent Application Publication No. 2006-238409, a scheme that is relatively permutation-free was introduced. However, combining this scheme with Japanese Patent No. 3881367 does not improve the accuracy of direction estimation.
[B5. Flexibility of Parameter Settings]
When computing the sound source direction from a separating matrix of ICA, the update frequency of direction and the corresponding segment coincide with the update frequency of the separating matrix and the learning data segment, and separate settings are not allowed. That is, the direction update interval substantially coincides with the block shift width 76 shown in FIG. 7, and the direction thus obtained is the mean of observed signals within the block (for example, the data segment 71 shown in FIG. 7).
It is necessary to determine the block shift width, the charging segment length, and further the number of learning loops (which affect the learning time) and the like by taking the separation performance, CPU power, and the like into account. Thus, what is optimal from the viewpoint of separation may not necessarily be optimal from the viewpoint of direction estimation.
The above-mentioned sound source direction estimation using ICA is a scheme which performs direction estimation after performing sound source separation (that is, ICA). On the other hand, in some schemes according to the related art, the above-mentioned order is reversed, that is, after performing sound source direction estimation, the resulting value is used to perform sound source separation. In the latter method, it is easy to establish correspondence between channels, and the separation itself can be performed with low delay. However, a problem common to schemes that perform “sound source separation after direction estimation” is that the accuracy of direction estimation affects the separation accuracy. Further, even if the direction itself can be estimated correctly, there is a possibility that separation accuracy may decrease due to other factors at the point of sound source separation.
An example in which sound source separation based on frequency masking is performed after direction estimation is disclosed in Japanese Unexamined Patent Application Publication No. 2004-325284 “METHOD OF ESTIMATING SOUND SOURCE DIRECTION AND SYSTEM FOR THE SAME, AND METHOD OF SEPARATING SOUND SOURCES AND SYSTEM FOR THE SAME”. In this scheme, after performing direction estimation on a per-frequency bin basis, signals in frequency bins where a target sound is dominant are left out to thereby extract the target sound. However, since this method assumes sparseness in the frequency domain (mixing of a target sound and an interference sound is rare at the same frequency), for mixing of signals for which that assumption does not hold, both the accuracy of direction estimation and the accuracy of sound source separation decrease.
An example in which sound source separation based on beamforming is performed after direction estimation is disclosed in Japanese Unexamined Patent Application Publication No. 2008-64892 “VOICE RECOGNITION METHOD AND VOICE RECOGNITION APPARATUS USING THE SAME”. In this system, after estimating the directions of arrival from sound sources at long distances or the positions of sound sources at short distances by using the MUSIC method (a technique in which signal and noise subspaces are obtained from eigenvalue decomposition of a correlation matrix, and the inverse of the dot product of an arbitrary sound source position vector and the noise subspace is obtained to examine the sound wave arrival direction and position of the sound source), a target sound is extracted by using the minimum variance beamformer (or the Frost beamformer). In the case of the minimum variance beamformer, it is necessary that the transfer function from the sound source of a target sound to each microphone be previously found. In Japanese Unexamined Patent Application Publication No. 2008-64892, the transfer function is estimated from the direction and position of the sound source. However, since a model with no reverberation is assumed, and values such as the sound velocity and microphone position are used in the estimation of the transfer function, the accuracy of separation decreases if such an assumption or values differ from those in the actual environment.
Summarizing the foregoing discussion, although several schemes have existed for sound source direction estimation in the related art, no system exists which combines the following advantage of the “Sound Source Estimation using Phase Difference Between Microphones”:                Both tracking lag and delay are small, and the following advantages of the “Sound Source Estimation Using ICA”:        Allows blind separation        Adapted to multiple sound sources        Easy to establish correspondence between the separated results and the sound source direction.        
Further, no system exists in which, in the sound source estimation using ICA, under the assumption that there is no permutation between frequency bins in the separated result, the relationship between the frequency bins is exploited to improve the accuracy of direction estimation.