1. Field of the Invention
The present invention relates to a signal processing apparatus, a signal processing method, and a program therefor. More specifically, the invention relates to a signal processing apparatus, a signal processing method, and a program that perform a process of separating signals, in which a plurality of signals are mixed, by using the independent component analysis (ICA). In particular, the process is a real-time process, that is, a process of separating observed signals, which are successively input, into independent components with little delay and successively outputting them.
2. Description of the Related Art
First, as a related art of the invention, a description will be given of the independent component analysis (ICA) and a real-time implementation method of the independent component analysis (ICA).
A1. Description of ICA
The ICA is a type of multivariate analysis, and is a technique of separating multidimensional signals by using the statistical properties of the signals. For details on the ICA itself, refer to, for example, “Introduction to the Independent Component Analysis” (Noboru Murata, Tokyo Denki University Press).
Hereinafter, a description will be given of ICA for sound signals, in particular, ICA in the time frequency domain.
As shown in FIG. 1, a situation is considered in which different sounds are being played from N sound sources, and those sounds are observed at n microphones. The sounds (source signals) produced from the sound sources are subject to time delays, reflections, and so on before arriving at the microphones. Therefore, signals observed at a microphone k (observed signals) can be represented as an expression that sums up convolutions between source signals and transfer functions with respect to all sound sources as indicated by Expression [1.1]. Hereinafter, these mixtures will be referred to as “convolutive mixtures”.
In addition, it is assumed that the observed signal of the microphone n is xn(t). The observed signals of the microphone 1 and the microphone 2 are x1(t) and x2(t).
Observed signals for all microphones can be represented by a single expression as in Expression [1.2] below.
                    Numerical        ⁢                                  ⁢        Expression        ⁢                                  ⁢        1                                                                                  x            k                    ⁡                      (            t            )                          =                                            ∑                              j                =                1                            N                        ⁢                                          ∑                                  l                  =                  0                                L                            ⁢                                                                    a                    kj                                    ⁡                                      (                    l                    )                                                  ⁢                                                      s                    j                                    ⁡                                      (                                          t                      -                      l                                        )                                                                                =                                    ∑                              j                =                1                            N                        ⁢                          {                                                a                  kj                                *                                  s                  j                                            }                                                          [        1.1        ]                                                      x            ⁡                          (              t              )                                =                                                    A                                  [                  0                  ]                                            ⁢                              s                ⁡                                  (                  t                  )                                                      +            …            +                                          A                                  [                  L                  ]                                            ⁢                              s                ⁡                                  (                                      t                    -                    L                                    )                                                                    ⁢                                  ⁢                  Here          ,                                    [        1.2        ]                                                      s            ⁢                          (              t              )                                =                      [                                                                                                      s                      1                                        ⁡                                          (                      t                      )                                                                                                                    ⋮                                                                                                                        s                      N                                        ⁡                                          (                      t                      )                                                                                            ]                          ,                              x            ⁡                          (              t              )                                =                      [                                                                                                      x                      1                                        ⁡                                          (                      t                      )                                                                                                                    ⋮                                                                                                                        x                      n                                        ⁡                                          (                      t                      )                                                                                            ]                          ,                              A                          [              l              ]                                =                      [                                                                                                      a                      11                                        ⁡                                          (                      l                      )                                                                                        ⋯                                                                                            a                                              1                        ⁢                        N                                                              ⁡                                          (                      l                      )                                                                                                                    ⋮                                                  ⋱                                                  ⋮                                                                                                                        a                                              n                        ⁢                                                                                                  ⁢                        1                                                              ⁡                                          (                      l                      )                                                                                        ⋯                                                                                            a                      nN                                        ⁡                                          (                      l                      )                                                                                            ]                                              [        1.3        ]            
Here, x(t) and s(t) are column vectors having xk(t) and sk(t) as elements, respectively. A[1] is an n×N matrix having a[1]kj as elements. In the following description, it is assumed that n=N.
It is common knowledge that convolutive mixtures in the time domain are represented as instantaneous mixtures in the time frequency domain. An analysis using this characteristic is ICA in the time frequency domain.
The time frequency domain ICA itself is with reference to, for example, “19.2.4 Fourier Transform Methods” of “Explanation of Independent Component Analysis” and Japanese Unexamined Patent Application Publication No. 2006-238409 “Audio Signal Separating Apparatus/Noise Removal Apparatus and Method”).
Hereinafter, features relating to the invention will be mainly described.
Application of a short-time Fourier transform on both sides of Expression [1.2] mentioned above yields Expression [2.1] below.
                    Numerical        ⁢                                  ⁢        Expression        ⁢                                  ⁢        2                                                                      X          ⁡                      (                          ω              ,              t                        )                          =                              A            ⁡                          (              ω              )                                ⁢                      S            ⁡                          (                              ω                ,                t                            )                                                          [        2.1        ]                                          X          ⁡                      (                          ω              ,              t                        )                          =                  [                                                                                          X                    1                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                                      ⋮                                                                                                          X                    n                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                ]                                    [        2.2        ]                                          A          ⁡                      (            ω            )                          =                  [                                                                                          A                    11                                    ⁡                                      (                    ω                    )                                                                              ⋯                                                                                  A                                          1                      ⁢                      N                                                        ⁡                                      (                    ω                    )                                                                                                      ⋮                                            ⋱                                            ⋮                                                                                      A                                      n                    ⁢                                                                                  ⁢                    1                    ⁢                                          (                      ω                      )                                                                                                  ⋯                                                                                  A                    nN                                    ⁡                                      (                    ω                    )                                                                                ]                                    [        2.3        ]                                          S          ⁡                      (                          ω              ,              t                        )                          =                  [                                                                                          S                    1                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                                      ⋮                                                                                                          S                    N                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                ]                                    [        2.4        ]                                          Y          ⁡                      (                          ω              ,              t                        )                          =                              W            ⁡                          (              ω              )                                ⁢                      X            ⁡                          (                              ω                ,                t                            )                                                          [        2.5        ]                                          Y          ⁡                      (                          ω              ,              t                        )                          =                  [                                                                                          Y                    1                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                                      ⋮                                                                                                          Y                    n                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                ]                                    [        2.6        ]                                          W          ⁡                      (            ω            )                          =                  [                                                                                          W                    11                                    ⁡                                      (                    ω                    )                                                                              ⋯                                                                                  W                                          1                      ⁢                      n                                                        ⁡                                      (                    ω                    )                                                                                                      ⋮                                            ⋱                                            ⋮                                                                                                          W                                          n                      ⁢                                                                                          ⁢                      1                                                        ⁡                                      (                    ω                    )                                                                              ⋯                                                                                  W                    nn                                    ⁡                                      (                    ω                    )                                                                                ]                                    [        2.7        ]            
In Expression [2.1],
ω is the frequency bin index, and
t is the frame index.
If ω is fixed, this expression can be regarded as instantaneous mixtures (mixtures with no time delay). Accordingly, to separate observed signals, Expression [2.5]for calculating the separation results [Y] is provided, and then a separating matrix W(ω) is determined so that, as the separation results, the individual components of Y(ω,t) are maximally independent.
In the case of time frequency domain ICA according to the related art, a so-called permutation problem occurs, in which “which component is separated into which channel” differs for each frequency bin. This permutation problem was almost entirely solved by the configuration disclosed in Japanese Unexamined Patent Application Publication No. 2006-238409 “Audio Signal Separating Apparatus/Noise Removal Apparatus and Method”, which is a patent application previously filed by the same inventor as the present application. Since this method is also employed in an embodiment of the invention, a brief description will be given of the technique for solving the permutation problem disclosed in Japanese Unexamined Patent Application Publication No. 2006-238409.
In Japanese Unexamined Patent Application Publication No. 2006-238409, in order to find a separating matrix W(ω), Expressions [3.1] to [3.3] represented as follows are iterated until the separating matrix W(ω) converges (or a certain number of times).
                    Numerical        ⁢                                  ⁢        Expression        ⁢                                  ⁢        3                                                                                  Y            ⁡                          (                              ω                ,                t                            )                                =                                    W              ⁡                              (                ω                )                                      ⁢                          X              ⁡                              (                                  ω                  ,                  t                                )                                                    ⁢                                  ⁢                  (                                    t              =              1                        ,            …            ⁢                                                  ,                                          T                ⁢                                                                  ⁢                ω                            =              1                        ,            …            ⁢                                                  ,            M                    )                                    [        3.1        ]                                          Δ          ⁢                                          ⁢                      W            ⁡                          (              ω              )                                      =                              {                          I              +                                                〈                                                                                    φ                        ω                                            ⁡                                              (                                                  Y                          ⁡                                                      (                            t                            )                                                                          )                                                              ⁢                                                                  Y                        ⁡                                                  (                                                      ω                            ,                            t                                                    )                                                                    H                                                        〉                                t                                      }                    ⁢                      W            ⁡                          (              ω              )                                                          [        3.2        ]                                          W          ⁡                      (            ω            )                          ←                              W            ⁡                          (              ω              )                                +                      ηΔ            ⁢                                                  ⁢                          W              ⁡                              (                ω                )                                                                        [        3.3        ]                                          Y          ⁡                      (            t            )                          =                              [                                                                                                      Y                      1                                        ⁡                                          (                                              1                        ,                        t                                            )                                                                                                                    ⋮                                                                                                                        Y                      1                                        ⁡                                          (                                              M                        ,                        t                                            )                                                                                                                    ⋮                                                                                                                        Y                      n                                        ⁡                                          (                                              1                        ,                        t                                            )                                                                                                                    ⋮                                                                                                                        Y                      n                                        ⁡                                          (                                              M                        ,                        t                                            )                                                                                            ]                    =                      [                                                                                                      Y                      1                                        ⁡                                          (                      t                      )                                                                                                                    ⋮                                                                                                                        Y                      n                                        ⁡                                          (                      t                      )                                                                                            ]                                              [        3.4        ]                                                      φ            ω                    ⁡                      (                          Y              ⁡                              (                t                )                                      )                          =                  [                                                                                          φ                    ω                                    ⁡                                      (                                                                  Y                        1                                            ⁡                                              (                        t                        )                                                              )                                                                                                      ⋮                                                                                                          φ                    ω                                    ⁡                                      (                                                                  Y                        n                                            ⁡                                              (                        t                        )                                                              )                                                                                ]                                    [        3.5        ]                                                                    φ              ω                        ⁡                          (                                                Y                  k                                ⁡                                  (                  t                  )                                            )                                =                                    ∂                              ∂                                                      Y                    k                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                        ⁢            log            ⁢                                                  ⁢                          P              ⁡                              (                                                      Y                    k                                    ⁡                                      (                    t                    )                                                  )                                                    ⁢                                  ⁢                              P            ⁡                          (                                                Y                  k                                ⁡                                  (                  t                  )                                            )                                ⁢                      :                    ⁢                                          ⁢          Probability          ⁢                                          ⁢          Density          ⁢                                          ⁢          Function          ⁢                                          ⁢                      (            PDF            )                    ⁢                                          ⁢          of          ⁢                                          ⁢                                    Y              k                        ⁡                          (              t              )                                                          [        3.6        ]                                          P          ⁡                      (                                          Y                k                            ⁡                              (                t                )                                      )                          ∝                  exp          ⁡                      (                                          -                γ                            ⁢                                                                                                            Y                      k                                        ⁡                                          (                      t                      )                                                                                        2                                      )                                              [        3.7        ]                                                                                                Y                k                            ⁡                              (                t                )                                                          m                =                              {                                          ∑                                  ω                  =                  1                                M                            ⁢                                                                                                            Y                      k                                        ⁡                                          (                                              ω                        ,                        t                                            )                                                                                        m                                      }                                1            /            m                                              [        3.8        ]                                                      φ            ω                    ⁡                      (                                          Y                k                            ⁡                              (                t                )                                      )                          =                              -            γ                    ⁢                                                    Y                k                            ⁡                              (                                  ω                  ,                  t                                )                                                                                                                          Y                    k                                    ⁡                                      (                    t                    )                                                                              2                                                          [        3.9        ]                                W        =                  [                                                                                          W                    11                                    ⁡                                      (                    1                    )                                                                                                                                                            0                                                                                                                                                                W                                          1                      ⁢                      n                                                        ⁡                                      (                    1                    )                                                                                                                                                            0                                                                                                                                                  ⋱                                                                                                                          ⋯                                                                                                                          ⋱                                                                                                                                                  0                                                                                                                                                                W                    11                                    ⁡                                      (                    M                    )                                                                                                                                                            0                                                                                                                                                                W                                          1                      ⁢                      n                                                        ⁡                                      (                    M                    )                                                                                                                                                                                    ⋮                                                                                                                          ⋱                                                                                                                          ⋮                                                                                                                                                                                        W                                          n                      ⁢                                                                                          ⁢                      1                                                        ⁡                                      (                    1                    )                                                                                                                                                            0                                                                                                                                                                W                    nn                                    ⁡                                      (                    1                    )                                                                                                                                                            0                                                                                                                                                  ⋱                                                                                                                          ⋯                                                                                                                          ⋱                                                                                                                                                  0                                                                                                                                                                W                                          n                      ⁢                                                                                          ⁢                      1                                                        ⁡                                      (                    M                    )                                                                                                                                                            0                                                                                                                                                                W                    nn                                    ⁡                                      (                    M                    )                                                                                ]                                    [        3.10        ]                                          X          ⁡                      (            t            )                          =                  [                                                                                          X                    1                                    ⁡                                      (                                          1                      ,                      t                                        )                                                                                                      ⋮                                                                                                          X                    1                                    ⁡                                      (                                          M                      ,                      t                                        )                                                                                                      ⋮                                                                                                          X                    n                                    ⁡                                      (                                          1                      ,                      t                                        )                                                                                                      ⋮                                                                                                          X                    n                                    ⁡                                      (                                          M                      ,                      t                                        )                                                                                ]                                    [        3.11        ]                                          Y          ⁡                      (            t            )                          =                  WX          ⁡                      (            t            )                                              [        3.12        ]            
In the following, such iteration will be referred to as “learning”. It should be noted, however, that Expressions [3.1] to [3.3] are performed on all frequency bins, and further, Expression [3.1] is performed on all the frames of accumulated observed signals. In addition, in Expression [3.2], <·>t denotes the mean over all frames. The superscript H attached at the upper right of Y(ω,t) indicates the Hermitian transpose (which takes the transpose of a vector or a matrix, and also transforms its elements into conjugate complex numbers).
The separation results Y(t) are represented by Expression [3.4], and denotes a vector in which elements of all the channels and all the frequency bins of the separation results are arranged. Also, φω(Y(t)) is a vector represented by Expression [3.5]. Each element φω(Yk(t)) is called a score function, and is a logarithmic derivative of the multidimensional (multivariate) probability density function (PDF) of Yk(t) (Expression [3.6]). As the multidimensional PDF, for example, a function represented by Expression [3.7] can be used, in which case the score function φω(Yk(t)) can be represented as Expression [3.9]. It should be noted, however, that ∥Yk(t)∥2 is an L-2 norm (obtained by finding the square sum of all elements and then taking the square root of the resulting sum) of the vector Yk(t). An L-m norm as a generalized form of the L-2 norm is defined by Expression [3.8]. In Expressions [3.7] and [3.9], 7 denotes a term for adjusting the scale of Yk(ω, t), for which an appropriate positive constant, for example, sqrt(M) (square root of the number of frequency bins) is substituted. In Expression [3.3], η is a positive small value (for example, about 0.1) called a learning ratio or learning factor. This is used for gradually reflecting ΔW(ω) calculated in Expression [3.2] on the separating matrix W(ω).
In addition, while Expression [3.1] represents separation in one frequency bin (refer to FIG. 2A), it is also possible to represent separation in all frequency bins by a single expression (refer to FIG. 2B).
This may be accomplished by using the separation results Y(t) in all frequency bins represented by Expression [3.4] described above, and observed signals X(t) represented by Expression [3.11], and further the separating matrices for all frequency bins represented by Expression [3.10]. By using those vectors and matrices, separation can be represented by Expression [3.12]. According to an embodiment of the invention, Expressions [3.1] and [3.11] are used selectively as necessary.
In addition, the diagrams of X1 to Xn and Y1 to Yn shown in FIGS. 2A and 2B are called spectrograms, in which the results of short-time Fourier transform (STFT) are arranged in the frequency bin direction and the frame direction. The vertical direction represents the frequency bin, and the horizontal direction represents the frame. While lower frequencies are noted at the top in Expressions [3.4] and [3.11], lower frequencies are drawn at the bottom in the spectrograms.
In the above description, it is assumed that the number of sound sources N is equal to the number of microphones n. However, even when N<n, the separation is possible. In this case, signals corresponding to the sound sources are respectively output on N channels of the n output channels, but almost-silent signals corresponding to none of the sound sources are output on n-N remaining channels.
A2. Real-Time Implementation of ICA
The learning process described in the section “A1. Description of ICA”, in which Expressions [3.1] to [3.3] are iterated until the separating matrix W(ω) converges (or a predetermined number of times), is performed by a batch process. That is, as described above, the iteration process of Expressions [3.1] to [3.3], in which Expressions [3.1] to [3.3] are iterated after accumulating the whole of the observed signals, is referred to as learning.
This batch process can be applied to real-time (low-delay) sound source separation through some contrivance. As an example of a sound source separation process realizing a real-time processing method, a description will be given of the configuration disclosed in “Japanese Unexamined Patent Application Publication No. 2008-147920: Real-Time Sound Source Separation Apparatus and Method”, which is a patent application previously filed by the same applicant as the present application.
As shown in FIG. 3, in the processing method disclosed in Japanese Unexamined Patent Application Publication No. 2008-147920, observed signal spectrograms are split into a plurality of overlapped blocks 1 to N, and learning is performed for each block, thereby finding a separating matrix. The reason why the blocks are overlapped is to achieve both the accuracy and the frequency of updates of the separating matrix.
In addition, in the case of real-time ICA (blockwise ICA) disclosed prior to Japanese Unexamined Patent Application Publication No. 2008-147920, there is no overlap between the blocks. Therefore, in order to shorten the update interval of the separating matrix, it is necessary to shorten the block length (=the time for which observed signals are accumulated). However, there is a problem in that a shorter block length results in lower separation accuracy.
As described above, the method of applying the batch process to each block of the observed signals is hereinafter referred to as a “blockwise batch process”.
A separating matrix found from each block is applied to subsequent observed signals (not applied to the same block) to generate the separation results. Herein, such a method will be referred to as a “shift application”.
FIG. 4 illustrates the “shift application”. Suppose that at the current time, t-th-frame observed signals X(t)42 are input. At this point, the separating matrix corresponding to the block containing the observed signals X(t) (for example, an observed signal block 46 containing the current time) has not been obtained yet. Accordingly, instead of the block 46, the observed signals X(t) are multiplied by the separating matrix learned from a learning data block 41 that is a block preceding the block 46, thereby generating the separation results corresponding to X(t), that is, separation results Y(t)44 at the current time. In addition, it is assumed that the separating matrix learned from the learning data block 41 is already obtained at the time point of the frame t.
As described above, a separating matrix is considered to represent a process the reverse of a mixing process.
Hence, when the mixing process is the same (for example, when the positional relation between sound sources and microphones has not changed) between the observed signals in the learning data block setting segment 41 and the observed signals 42 at the current time, signal separation can be performed even when a separating matrix learned in a different segment is applied. In such a manner, it is possible to realize separation with little delay.
The configuration disclosed in Japanese Unexamined Patent Application Publication No. 2008-147920 proposes a method in which a plurality of processing units called threads for finding a separating matrix from overlapped blocks are run in parallel per unit time shifts. This parallel processing method will be described with reference to FIG. 5.
FIG. 5 shows the transitions of processing over time of individual threads serving as the units of processing. FIG. 5 shows six threads 1 to 6. Each thread repeats three states of A) Accumulating, B) Learning, and C) Waiting. That is, the thread length corresponds to the total time length of the three processes of A) Accumulating, B) Learning, and C) Waiting. Time progresses from left to right in FIG. 5.
The “A) Accumulating” is the segment of dark gray in FIG. 5. When in this state, a thread accumulates observed signals. The overlapped blocks in FIG. 5 can be expressed by shifting the accumulation start times between threads. Since the accumulation start time is shifted by ¼ of the accumulation time in FIG. 5, assuming that the accumulation time in one thread is, for example, four seconds, the shifted time between threads equals one second.
Upon accumulating observed signals for a predetermined time (for example, four seconds), the state of each thread transitions to “B) Learning”. The “B) Learning” is the segment of light gray in FIG. 5. When in this state, Expressions [3.1] to [3.3] described above are iterated with respect to the accumulated observed signals.
When the separating matrix W has sufficiently converged (or simply upon reaching a predetermined number of iterations) by learning (iteration of Expressions [3.1] to [3.3]), the learning is ended, and the thread transitions to the “C) Waiting” state (the white segment in FIG. 5). The “Waiting” is provided for keeping the accumulation start time and the learning start time at a constant interval between threads. As a result, the learning end time (=the time at which the separating matrix is updated) is also kept at a substantially constant interval.
The separating matrix W obtained by learning is used for performing separation until learning in the next thread is finished. That is, the separating matrix W is used as a separating matrix 43 shown in FIG. 4. A description will be given of the separating matrix used in each applied-separating-matrix specifying segment 51 to 53 along the progression of time shown at the bottom of FIG. 5.
In the applied-separating-matrix specifying segment 51 from when the system is started to when the first separating matrix is learned, an initial value (for example, a unit matrix) is used as the separating matrix 43 in FIG. 4. In the segment 52 from when learning in the thread 1 shown in FIG. 5 is finished to when learning in the thread 2 is finished, a separating matrix derived from an observed-signal accumulating segment 54 in the thread 1 is used as the separating matrix 43 shown in FIG. 4. The numeral “1” shown in the segment 52 in FIG. 5 indicates that the separating matrix W used in this period is obtained through processing in the thread 1. The numerals on the right from the applied-separating-matrix specifying segment 52 also each indicate from which thread the corresponding separating matrix is derived.
In addition, when a separating matrix obtained in another thread exists at the point of starting learning, the separating matrix is used as the initial value of learning. This is referred to as “inheritance of a separating matrix”. In the example shown in FIG. 5, at learning start timing 55 at which the first learning is started in the thread 3, the separating matrix 52 derived from the thread 1 is already obtained, so the separating matrix 52 is used as the initial value of learning.
By performing such processing, it is possible to prevent or reduce the occurrence of permutation between threads. Permutation between threads refers to, for example, a problem such that in the separating matrix obtained in the first thread, voice is output on the first channel and music is output on the second channel, whereas those are reversed in the separating matrix obtained in the third thread.
As described above with reference to FIG. 5, permutation between threads can be reduced by performing “inheritance of a separating matrix” so that the separating matrix is used as the initial value of learning when a separating matrix that has been obtained in another thread exists. In addition, even when a separating matrix has not sufficiently converged by learning in the thread 1, the degree of convergence can be improved as the separating matrix is inherited by the next thread.
By running a plurality of threads per unit time shifts in this way, the separating matrix is updated at an interval substantially equal to a shift between threads, that is, a block shift width 56.
B. Problems of Related Art
Next, the problems in the “A2. Real-time Implementation of ICA” described above will be studied. In the combination of the “blockwise batch process” and the “shift application” described in “A2. Real-time Implementation of ICA”, the sound source separation may be not accurately performed. As the reason, the following two factors can be considered separately.
B1. Tracking lag
B2. Residual sound
Hereinafter, the respective reasons why the two factors cause inaccuracy in the sound source separation will be described.
B1. Tracking Lag
When the “shift application” is employed, a mismatch occurs temporarily when the sound sources are changed (when the sound sources are moved or start playing sounds suddenly) between the segment used for learning of a separating matrix (for example, the learning data block 41 shown in FIG. 4) and the observed signals 42 at the current time.
Thereafter, as a new separating matrix is obtained by the learning process which observes the changed sound sources, such a mismatch disappears eventually. However, until the new separating matrix is generated, the mismatch exists. This phenomenon will be herein referred to as a “tracking lag”. The tracking lag may be caused even when the sound starts playing suddenly or the sound stops playing and then starts playing again although the sound sources are not moved. Hereinafter, such a sound is referred to as a “sudden sound”.
FIG. 6 is a diagram illustrating correspondence between the sudden sound and the observed signal. In the example of FIG. 6, two sound sources are supposed to be provided.
(a) Sound source 1
(b) Sound source 2
The two sound sources are employed.
Time progresses from left to right. The block heights of the (a) sound source 1, the (b) sound source 2, and the (c) observed signal represent volumes thereof.
The (a) sound source 1 plays twice with the silent segment 67 interlaid therebetween. Output segments of the sound source are respectively represented by the sound-source-1 output segments 61 and 62. The sounds are output at the current time at which the current observed signal 66 is being observed.
The (b) sound source 2 plays continuously. That is, the sound source 2 has a sound-source-2 output segment 63.
The (c) observed signal can be represented by the sum of the signals which reach the microphones from the sound sources 1 and 2.
The block 64 of the learning data indicated by the dotted-line area in the (c) observed signal is the same segment as the learning data block 41 shown in FIG. 4. The separating matrix learned from the observed signal in the segment of the learning data block 64 is applied to the observed signal 66 at the current time (t1), thereby performing the separation. The segment 65 (the segment 65 from the block end to the current time) resides between the learning data block 64 and the observed signal 66 at the current time (t1).
The observed signal 66 at the current time (t1) is an observed signal based on the sound source output 69 at the current time.
However, sometimes a mismatch may occur between the learning data and the current observed signal in accordance with the length of the silent segment 67 of the sound source 1 and the length of the learning data block 64 (which is the same as the learning data block 41 shown in FIG. 4).
For example, in the (c) observed signal, the observed signal 66 at the current time (t1) includes both the sound-source-1 output segment 62 derived from the sound source 1 and the sound-source-2 output segment 63 derived from the sound source 2 as an observed signal. In contrast, in the learning data block 64, only the sound-source-2 output segment 63 originated from the sound source 2 was observed.
Similar to the observed signal 66 at the current time (t1), the situation, in which the sound out of the learning data block is currently being played, is expressed as “a sudden sound is generated”. In other words, since the learning data block 64 does not include the observed signal of the sound source 1, although the sound source 1 plays ahead of the block (corresponding to the sound-source-1 output segment 61), the sound of the sound source 1 (the segment of the sound-source-1 output segment 62) is the sudden sound in the separating matrix learned in the learning data block 64.
FIG. 7 is a diagram illustrating an effect of the sudden sound generation on the separation result, particularly, the tracking lag. FIG. 7 shows the following data.
(a) Observed Signal
(b1) Separation Result 1
(b2) Separation Result 2
(b3) Separation Result 3
Time progresses from left to right in the drawing.
In the example shown in FIG. 7, it is assumed that the ICA (independent component analysis) system has three or more microphones and the number of output channels is also three or more.
The (a) observed signal includes the continuous sound 71 which is continuously played in the range of the time t0 to t5 and the sudden sound 72 which is output only in the range of the time t1 to t4.
The (a) observed signal in FIG. 7 is an observed signal similar to the (c) observed signal in FIG. 6. In addition, for example, the continuous sound 71 corresponds to the (b) sound source 2 in FIG. 6, and the sudden sound 72 corresponds to the (a) sound source 1 in FIG. 6.
Before the start of the output of the sudden sound 72, the separating matrix is sufficiently converged in the segment 73 from t0 to t1 during which only the continuous sound 71 is being played, and then the signal corresponding to the continuous sound 71 is output on only one channel. This is the (b1) separation result 1. Almost silent sound is output on other channels, that is, the (b2) separation result 2 and the (b3) separation result 3.
Here, suppose that the sudden sound 72 occurs. For example, someone who has been silent may suddenly start talking. At this time, the separating matrix applicable to the observed signal is a separating matrix which is generated by learning the data before the generation of the sudden sound 72, that is, only the data of the continuous sound 71 prior to the time t1 as observation data.
As a result, by applying the separating matrix generated on the basis of the observed signal prior to the time t1, the observed signal obtained by observing the sudden sound 72 after the time t1 is separated, and thus it is difficult to obtain a correct separation result corresponding to the observed signal. The reason is that the separating matrix generated on the basis of the observed signal prior to the time t1 is a separating matrix in which the sudden sound 72 included in the observed signal after the time t1 is not considered. Consequently, as the separation results from the application of the separating matrix, for example, a mismatch occurs between the actual observed signal, that is, the observed signal as a mixture of the continuous sound 71 and the sudden sound 72, and the separation results in the range of the time t1 to t3.
In the time period from when the play of the sudden sound is started to when the separating matrix in which the sudden sound is reflected is learned (in the segment 74 from the time t1 to t2), the phenomenon, in which the sudden sound is output on all the channels (the (b1) separation result 1, the (b2) separation result 2, and the (b3) separation result 3), occurs. That is, the sudden sound is scarcely subjected to the sound source separation. This time period is minimally equal to a value slightly larger than the learning time, and is maximally equal to the sum of the learning time and the block shift width. For example, in the system in which the learning time is 0.3 seconds and the block shift is 0.2 seconds, the sudden sound is not separated and is output on all the channels in a little over 0.3 seconds minimum and 0.5 seconds maximum.
Thereafter, in order of the learning process in a new learning block, a new separating matrix is generated and updated. The separating matrix update process excludes one channel (in FIG. 7, the (b2) separation result 2) as the sudden sound is reflected in the separating matrix, thereby decreasing the output of the sudden sound (in the segment 75 from the time t2 to t3). In due time, the output exists only on the one channel (the (b2) separation result 2) (in the segment 76 after t3).
In the example shown in FIG. 7, the segment in which the tracking lag occurs is a combined segment of the segment 74 from the time t1 to t2 and the segment 75 from the time t2 to t3, that is, the segment 77 from the time t1 to t3.
The causes of the problem of the tracking lag, which occurs when the sudden sound is generated, are different depending on whether the sudden sound is a target sound or an interference sound. Hereinafter, each case will be described. The target sound means a sound serving as an analysis target.
When the sudden sound is the interference sound, in other words, when the continuous sound 71 continuously played is the target sound, it is preferable to remove the sudden sound. Accordingly, the problem is that the interference sound is not removed and remains in the (b1) separation result 1 shown in FIG. 7.
On the other hand, when the sudden sound is the target sound, it is preferable to retain the sudden sound but remove the continuous sound 71 played continuously as the interference sound. It seems that the (b2) separation result 2 shown in FIG. 7 corresponds to such an output. However, a mismatch occurs between the input and the separating matrix in the segment 77 from the time t1 to t3 in which the tracking lag occurs. Hence, there is a possibility that the output sound is distorted (a possibility that balance between frequencies becomes different from the source signal). That is, when the sudden sound is the target sound, a problem arises in that the output sound may be distorted.
As described above, depending on the characteristics of the sudden sound, it is necessary to perform contrary processes of removing or retaining the sound. Hence, it is difficult to solve the problem by using a single method.
B2. Residual Sound
Next, in the combination of the “blockwise batch process” and the “shift application” described in the “A2. Real-Time Implementation of ICA”, “residual sound” as another factor which causes inaccuracy in the sound source separation will be described.
For example, the separating matrix is sufficiently converged in the segment 73 from the time t0 to t1, the segment 76 from the time t3 to t4, or the like in FIG. 7, and the separation of the observed data is performed by applying separating matrix based on the preceding learning data. In such a manner, it is possible to perform accurate separation. However, even in such a segment, one sound source is not perfectly output on one channel, but other sound sources remain to a certain extent. This is called the “residual sound”. For example, the residual sound 78 shown in FIG. 7 is a sound which should not remain in the (b2) separation result. Likewise, the residual sound 79 is also a sound which should not be present in the (b3) separation result 3.
The following points are considered as factors which cause the residual sound.
a) The length of the spatial reverberation is longer than the frame length of the short-time Fourier transform (STFT).
b) The number of the sound sources is larger than the number of the microphones.
c) The space between microphones is narrow, and thus the interference sound is not removed at a low frequency.
In the sound source separation system using the real-time ICA, there is a trade-off between the reduction in the tracking lag and the reduction in the residual sound. The reason is that it is advantageous for the reduction in the tracking lag to shorten the learning time but the residual sound increases in accordance with the method therefor.
The computational cost for the learning of the ICA is in proportion to the frame length of the short-time Fourier transform (STFT), and the square of the number of channels (the number of microphones). Accordingly, when the value is set to be small, it is possible to shorten the learning time although the number of loops is the same. Hence, it is also possible to shorten the tracking lag.
However, the reduction in the frame length further deteriorates one of the factors causing the residual sound, that is, the factor a).
Further, the reduction in the number of microphones further deteriorates one of the factors causing the residual sound, that is, the factor b).
Accordingly, a process of shortening the frame length of the short-time Fourier transform (STFT) or a process of reducing the number of channels (the number of microphones) contributes to the reduction in the tracking lag, whereas a problem arises in that the residual sound tends to occur.
As described above, the reduction in tracking lag and the residual sound are in a relationship in which, if one is intended to be solved, the other deteriorates.
The residual sound 78 shown in FIG. 7 is naturally separated as the continuous sound being played, that is, a sound corresponding to the (b1) separation result 1. Hence, when the residual sound occurs, separation performance of components (the sudden sound 72 in the (b1) separation result 1), which are dominantly output on the channel, deteriorates.
On the other hand, when the above-mentioned “tracking lag” is large, the time, at which the accurate separation result of the sudden sound is obtained, is delayed. Specifically, there is an increase in the time period from the time t1, at which the sudden sound is generated, shown in FIG. 7 to the time t3 at which the sound corresponding to the sudden sound is separated on the channel corresponding to the sudden sound, that is, only in the b2) separation result 2.
There may be different selections as to which sound source of a plurality of sound sources it is desirable to acquire the sound from, depending on their purpose. Here, the sound to acquire the accurate separation result is referred to as a “target sound”.
Depending on where between the continuous sound being played and the sudden sound the “target sound” is, it is preferable to perform a different process and a different setting.
The remaining one of the factors causing the residual sound is as follows.
c) Since the spaces between microphones are narrow, the interference sound is not removed at a low frequency.
This factor is irrespective of the real-time process. However, the problem can be solved by the configuration according to the embodiment of the invention, and will be thus described herein. In the ICA in the time frequency domain, when the spaces between the microphones are narrow (for example, about 2 to 3 cm), separation may not be sufficiently performed particularly at a low frequency. The reason is that it is difficult to obtain a sufficient phase difference in the spaces between the microphones. The separation accuracy at a low frequency can be improved by increasing the microphone spaces, whereas the separation accuracy at a high frequency is likely to be lowered by the phenomenon which is called spatial aliasing. Further, because of physical restriction, sometimes the microphones may not be installed with wide spaces.
The above-mentioned problems are summarized as follows.
(A) In the real-time ICA using the “blockwise processing” and the “shift application”, the “tracking lag” or the “residual sound” is caused by the sudden sound, and thus the sound source separation may be not accurately performed.
(B) The methods of coping with the “tracking lag” and the “residual sound” for accurately performing the sound source separation are contrary to each other depending on whether the sudden sound is the target sound or the interference sound. Hence, it is difficult to solve the problem by using a single method.
(C) In the framework of the real-time ICA according to the related art, there may be a trade-off relationship between the reduction in the “tracking lag” and the cancellation of the “residual sound”.