Numbers of techniques have been disclosed for classifying voiced sound intervals from voice signals collected by a plurality of microphones, one of which is recited, for example, in Patent Literature 1.
For correctly determining a voiced sound interval of each of a plurality of microphones, the technique recited in Patent Literature 1 includes firstly classifying each observation signal of each time frequency converted into a frequency domain on a sound source basis and making determination of a voiced sound interval or a voiceless sound interval with respect to each observation signal classified.
Shown in FIG. 5 is a diagram of a structure of a voiced sound interval classification device according to such background art as Patent Literature 1. Common voiced sound interval classification devices according to the background art include an observation signal classification unit 501, a signal separation unit 502 and a voiced sound interval determination unit 503.
Shown in FIG. 8 is a flow chart showing operation of a voiced sound interval classification device having such a structure according to the background art.
The voiced sound interval classification device according to the background art firstly receives input of a multiple microphone voice signal xm (f, t) obtained by time-frequency analysis by each microphone of voice observed by a number M of microphones (here, m denotes a microphone number, f denotes a frequency and t denotes time) and a noise power estimate λm (f) for each frequency of each microphone (Step S801).
Next, the observation signal classification unit 501 classifies a sound source with respect to each time frequency to calculate a classification result C (f, t) (Step S802).
Then, the signal separation unit 502 calculates a separation signal yn (f, t) of each sound source by using the classification result C (f, t) and the multiple microphone voice signal (Step S803).
Then, the voiced sound interval determination unit 503 makes determination of voiced sound or voiceless sound with respect to each sound source based on S/N (signal-noise ratio) by using the separation signal yn (f, t) and the noise power estimate λm (f) (Step S804).
Here, as shown in FIG. 6, the observation signal classification unit 501, which includes a voiceless sound determination unit 602 and a classification unit 601, operates in a manner as follows. Flow chart illustrating operation of the observation signal classification unit 501 is shown in FIG. 9.
First, an S/N ratio calculation unit 607 of the voiceless sound determination unit 602 receives input of the multiple microphone voice signal xm (f, t) and the noise power estimate λm, (f) to calculate an S/N ratio γm (f, t) for each microphone according to an Expression 1 (Step S901).
                                          γ            m                    ⁡                      (                          f              ,              t                        )                          =                                                                                            x                  m                                ⁡                                  (                                      f                    ,                    t                                    )                                                                    2                                              λ              m                        ⁡                          (              f              )                                                          (                  Expression          ⁢                                          ⁢          1                )            
Next, a nonlinear conversion unit 608 executes nonlinear conversion with respect to the S/N ratio for each microphone according to the following expression to calculate an S/N ratio Gm (f, t) as of after the nonlinear conversion (Step S902).Gm(f,t)=γm(f,t)−ln γm(f,t)−1
Next, a determination unit 609 compares the predetermined threshold value η′ and S/N ratio Gm (f, t) of each microphone as of after the nonlinear conversion and when the S/N ratio Gm (f, t) as of after the nonlinear conversion is not more than the threshold value in each microphone, considers a signal at the time-frequency as noise to output C (f, t)=0 (Step S903). The classification result C (f, t) is cluster information which assumes a value from 0 to N.
Next, a normalization unit 603 of the classification unit 601 receives input of the multiple microphone voice signal xm (f, t) to calculate X′(f, t) according to the Expression 2 in an interval not determined to be noise (Step S904).
                                          X            ′                    ⁡                      (                          f              ,              t                        )                          =                              [                                                                                                                                                  x                        1                                            ⁡                                              (                                                  f                          ,                          t                                                )                                                                                                                                                              ⋮                                                                                                                                                                    x                        M                                            ⁡                                              (                                                  f                          ,                          t                                                )                                                                                                                                      ]                                                          [                                                                                                                                                                  x                          1                                                ⁡                                                  (                                                      f                            ,                            t                                                    )                                                                                                                                                                                ⋮                                                                                                                                                                                      x                          M                                                ⁡                                                  (                                                      f                            ,                            t                                                    )                                                                                                                                                      ]                                                                      (                  Expression          ⁢                                          ⁢          2                )            
X′(f, t) is a vector obtained by normalization by a norm of an M-dimensional vector having amplitude absolute values |xm (f, t)| of signals of M microphones.
Subsequently, a likelihood calculation unit 604 calculates a likelihood pn (X′(f, t)) n=1, . . . , N of a number N of speakers expressed by a Gaussian distribution having a mean vector determined in advance and a covariance matrix with a sound source model (Step S905).
Next, a maximum value determination unit 606 outputs n with which the likelihood pn (X′(f, t)) takes the maximum value as C (f, t)=n (Step S906).
Here, although the number of sound sources N and M may differ, n will take any value of 1, . . . , M because any of the microphones is assumed to be located near each of the N speakers as sound sources.
With a Gaussian distribution having a direction of each of M-dimensional coordinate axes as a mean vector as an initial distribution, a model updating unit 605 updates a sound source model by updating a mean vector and a covariance matrix by the use of a signal which is classified into its sound source model by using a speaker estimation result.
The signal separation unit 502 separates the applied multiple microphone voice signal xm (f, t) and the C (f, t) output by the observation signal classification unit 501 into a signal yn (f, t) for each sound source according to an Expression 3.
                                          y            n                    ⁡                      (                          f              ,              t                        )                          =                  {                                                                                          x                                          k                      ⁡                                              (                        n                        )                                                                              ⁡                                      (                                          f                      ,                      t                                        )                                                                                                                    if                    ⁢                                                                                  ⁢                                          C                      ⁡                                              (                                                  f                          ,                          t                                                )                                                                              =                  n                                                                                    0                                            otherwise                                                                        (                  Expression          ⁢                                          ⁢          3                )            
Here, k (n) represents the number of a microphone closest to a sound source n which is calculated from a coordinate axis to which a Gaussian distribution of a sound source model is close.
The voiced sound interval determination unit 503 operates in a following manner.
The voiced sound interval determination unit 503 first obtains Gn (t) according to an Expression 4 by using the separation signal yn (f, t) calculated by the signal separation unit 502.
                                                        γ              n                        ⁡                          (                              f                ,                t                            )                                =                    ⁢                                                                                                          y                    n                                    ⁡                                      (                                          f                      ,                      t                                        )                                                                              2                                                      λ                                  k                  ⁡                                      (                    n                    )                                                              ⁡                              (                f                )                                                    ,                                  ⁢                                            G              n                        ⁡                          (              t              )                                =                                    1                                              F                                                      ⁢                                          ∑                feF                            ⁢                              [                                                                            γ                      n                                        ⁡                                          (                                              f                        ,                        t                                            )                                                        -                                      ln                    ⁢                                                                                  ⁢                                                                  γ                        n                                            ⁡                                              (                                                  f                          ,                          t                                                )                                                                              -                  1                                ]                                                                        (                  Expression          ⁢                                          ⁢          4                )            
Subsequently, the voiced sound interval determination unit 503 compares the calculated Gn (t) and a predetermined threshold value η and when Gn (t) is larger than the threshold value η, determines that time t is within a speech interval of the sound source n and when Gn (t) is not more than η, determines that time t is within a noise interval.
F represents a set of wave numbers to be taken into consideration and |F| represents the number of elements of the set F.
Patent Literature 1: Japanese Patent Laying-Open No. 2008-158035.
Non-Patent Literature 1: P. Fearnhead, “Particle Filters for Mixture Models with an Unknown Number of Components”, Statistics and Computing, vol 14, pp. 11-21, 2004.
Non-Patent Literature 2: B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images”, Nature vol. 381, pp 607-609, 1996.
By the technique recited in the Patent Literature 1, for sound source classification executed by the observation signal classification unit 501, calculation is made assuming that a normalization vector X′ (f, t) is in a direction of a coordinate axis of a microphone close to a sound source.
In practice, however, since voice power always varies in a case, for example, where a sound source is a speaker, a normalization vector X′ (f, t) is far away from a coordinate axis direction of a microphone even when a sound source position does not shift at all, so that a sound source of an observation signal cannot be classified with enough precision.
Shown in FIG. 7 is a signal observed by two microphones, for example. Assuming now that a speaker close to a microphone number 2 makes a speech, voice power always varies in a space formed of observation signal absolute values of two microphones even if a sound source position has no change, so that the vector will vary on a bold line in FIG. 7.
Here, λ1 (f) and λ2 (f) each represent noise power whose square root is on the order of a minimum amplitude observed in each microphone.
At this time, although the normalization vector X′ (f, t) will be a vector constrained on a circular arc with a radius of 1, even when an observed amplitude of the microphone number 1 is approximately as small as a noise level and an observed amplitude of the microphone number 2 has a region larger enough than the noise level (i.e. γ2 (f, t) exceeds a threshold value η′ to consider the interval as a voiced sound interval), X′ (f, t) will largely derivate from the coordinate axis of the microphone number 2 (i.e. sound source direction) to fluctuate on the bold line in FIG. 7, thereby making classification of a sound source difficult and resulting in erroneously determining the voice interval of the microphone number 2 as a voiceless sound and deteriorating voice interval detection performance.
The technique recited in the Patent Literature 1 has another problem that since the number of sound sources is unknown in the observation signal classification unit 501, it is difficult for the likelihood calculation unit 604 to set a sound source model appropriate for sound source classification, so that a classification result will have an error, and as a result, voice interval detection performance will be deteriorated.
In a case, for example, where with two microphones and three sound sources (speakers), the third speaker is located near the middle point between the two microphones, sound sources cannot be appropriately classified by a sound source model close to the microphone axis. In addition, it is difficult to prepare a sound source model at an appropriate position apart from a microphone axis without advance-knowledge of the number of speakers, so that classification of a sound source of an observation signal is impossible and as a result, voice interval detection performance will be deteriorated.
When deterioration of an observation signal classification performance is caused by mixed use of different kinds of microphones without being calibrated, an amplitude value or a noise level varies with each microphone to have an increased effect, resulting in further deteriorating voice interval detection performance.