Technology known as a voice switch, technology known as a Wiener filter, and the like, are examples of noise suppression technology (see Japanese Patent Application Laid-Open (JP-A) 2006-333215 (Patent Document 1), and Japanese National-Phase Publication 2010-532879 (Patent Document 2)).
A voice switch is technology in which segments (target-sound segments) spoken by a speaker are detected in an input signal using a target-sound segment detection function, any target-sound segments are output unprocessed, and the amplitude is attenuated for any non-target-sound segments. For example, as illustrated in FIG. 12, when an input signal input is received, determination is made as to whether or not the input signal input is a target-sound segment (step S51), a gain VS_GAIN is set to 1.0 if the input signal input is a target-sound segment (step S52), and the gain VS_GAIN is set to a freely chosen positive value α less than 1.0 if the input signal input is a non-target-sound segment (step S53). The product of the input signal input and the gain VS_GAIN is then obtained as an output signal output (step S54).
Applying this voice switch technology to audio communications equipment such as a teleconference device or a mobile telephone enables non-target-sound segments (noise) to be suppressed and a desired target-sound to be extracted, thereby enabling an improvement in speech sound quality.
The non-target-sound can be divided into “interfering-sounds” that are human voices not belonging to the speaker, and “background noise” such as office noise or road noise. Although target-sound segments can be accurately determined using ordinary target-sound segment detection functions when the non-target-sound segments are background noise alone, erroneous determination occurs when interfering-sounds are superimposed on background noise, due to the target-sound segment detection function also designating the interfering-sound as target-sound. As a result, interfering-sounds cannot be suppressed by such voice switches, and sufficient speech sound quality is not attained.
This issue is improved by switching a feature value referenced by a target-sound segment detection section from variation in the input signal level employed hitherto, to coherence. Put simply, coherence is a feature value signifying the arrival direction of an input signal. Consider use of a mobile telephone; the speaker's voice (the target-sound) arrives from the front face, and interfering-sounds have a strong tendency to arrive from faces other than the front face, enabling target-sound to be distinguished from interfering-sounds, something that was not hitherto possible, by observing the arrival direction.
FIG. 13 is a block diagram illustrating a configuration of a voice switch when coherence is employed by a target-sound detection function.
A pair of microphones m_1, and m_2 respectively acquire input signals s1(n) and s2(n) through an AD converter, omitted from illustration. Note that n is an index indicating the input sequence of the samples, and is expressed as a positive integer. In the present specification, the lower the value of n, the older the input sample, and the greater the value, the newer the input sample.
An FFT section 10 acquires input signal series s1(n) and s2(n) from the microphones m_1 and m_2, and performs a fast Fourier transform (or a discrete Fourier transform) on the input signals s1 and s2. This thereby enables the input signals s1 and s2 to be expressed in the frequency domain. When performing fast Fourier transform, analysis frames FRAME 1 (K) and FRAME 2 (K) are formed from a specific number N of samples from the input signals s1(n) and s2(n), and then applied. An example of configuring the analysis frames FRAME 1 (K) from the input signal s1(n) is represented by Equation (1) below, and similar applies to the analysis frames FRAME 1 (K).
                                              ⁢                                            F              ⁢                                                          ⁢              R              ⁢                                                          ⁢              A              ⁢                                                          ⁢              M              ⁢                                                          ⁢              E              ⁢                                                          ⁢              1              ⁢                              (                1                )                                      =                          {                                                s                  ⁢                                                                          ⁢                  1                  ⁢                                      (                    1                    )                                                  ,                                  s                  ⁢                                                                          ⁢                  1                  ⁢                                      (                    2                    )                                                  ,                ⋯                ,                                  s                  ⁢                                                                          ⁢                  1                  ⁢                                      (                    N                    )                                                              }                                ⁢                                          ⁢                                          ⁢          ⋮          ⁢                                          ⁢                                    F              ⁢                                                          ⁢              R              ⁢                                                          ⁢              A              ⁢                                                          ⁢              M              ⁢                                                          ⁢              E              ⁢                                                          ⁢              1              ⁢                              (                K                )                                      =                          {                                                s                  ⁢                                                                          ⁢                  1                  ⁢                                      (                                                                  N                        ×                                                  (                                                      K                            -                            1                                                    )                                                                    +                      1                                        )                                                  ,                                  s                  ⁢                                                                          ⁢                  1                  ⁢                                      (                                                                  N                        ×                                                  (                                                      K                            -                            1                                                    )                                                                    +                      2                                        )                                                  ,                ⋯                ⁢                                                                  ,                                  s                  ⁢                                                                          ⁢                  1                  ⁢                                      (                                                                  N                        ×                                                  (                                                      K                            -                            1                                                    )                                                                    +                      K                                        )                                                              }                                                          (        1        )            
Note that K is an index indicating a sequence number for frames, and represents a positive integer. In the present specification, the lower the value of K, the older the analysis frame, and the greater the value, the newer the analysis frame. In the explanation of operation that follows, the index that indicates the latest analysis frame, this being the analysis target, is K unless specifically stated otherwise.
The FFT section 10 performs transformation into frequency domain signals X1 (f, K), X2 (f, K) by performing a fast Fourier transform on each analysis frame, and the obtained frequency domain signals X1 (f, K) and X2 (f, K) are provided to a corresponding first directionality forming section 11, and second direction directionality forming section 12 respectively. Note that f is an index indicating the frequency. Moreover, X1 (f, K) is not a single value, and is composed from plural spectral components of frequencies f1 to fm as expressed by Equation (2). Similar applies to X2 (f, K), and to B1 (f, K) and B2 (f, K), described later.X1(f,K)=[(f1,K),(f2,K), . . . ,(fm,K)]  (2)
In the first directionality forming section 11, a signal B1 (f, K) having strong directionality in a specific direction is formed from the frequency domain signals X1 (f, K) and X2 (f, K). In the second direction directionality forming section 12, a signal B2 (f, K) having strong directionality in a specific direction (different from that of the specific direction mentioned previously) is formed from the frequency domain signals X1 (f, K) and X2 (f, K). An existing method may be applied as the method of forming the signals B1 (f, K), B2 (f, K) having strong directionality in a specific direction. For example, Equation (3) may be applied to form B1 (f, K) having strong left-direction directionality, and Equation (4) may be applied to form B2 (f, K) having strong right-direction directionality. In Equation (3) and Equation (4), the frame index K has no effect on the computation and is therefore omitted.
                              B          ⁢                                          ⁢          1          ⁢                      (            f            )                          =                              x            ⁢                                                  ⁢            2            ⁢                          (              f              )                                -                      X            ⁢                                                  ⁢            1            ⁢                          (              f              )                        ×                          exp              ⁡                              [                                                      -                                                                  ⅈ2π                        ⁢                                                                                                  ⁢                        fS                                            N                                                        ⁢                  τ                                ]                                                                        (        3        )                                          B          ⁢                                          ⁢          2          ⁢                      (            f            )                          =                              x            ⁢                                                  ⁢            1            ⁢                          (              f              )                                -                      X            ⁢                                                  ⁢            2            ⁢                          (              f              )                        ×                          exp              ⁡                              [                                                      -                                                                  ⅈ2π                        ⁢                                                                                                  ⁢                        fS                                            N                                                        ⁢                  τ                                ]                                                                        (        4        )            Wherein:                S: sampling frequency        N: FFT analysis frame length        τ: Difference in sound wave arrival time between microphones        i: imaginary unit        f: frequency        
The significance of these equations is explained using FIG. 14A, FIG. 14B, FIG. 15A, and FIG. 15B, using Equation (3) as an example. Consider a sound wave arriving from a direction θ indicated in FIG. 14A picked up by a pair of microphones m_1 and m_2 positioned a distance 1 apart. In such an event, a difference arises in time until the sound wave arrives at the microphones m_1 and m_2. For a sound path difference d, this arrival time difference τ is d=1×sin θ, thus giving Equation (5), wherein c is the speed of sound.τ=1×sin θ/c  (5)
A signal s1 (t−τ), from the input signal s1 (n) delayed by τ, is identical to the input signal s2 (t). A signal y (t) taking the difference between these signals=s2 (t)−s1 (t−τ), is accordingly a signal in which sound arriving from the direction θ is eliminated. As a result, the microphone array m_1 and m_2 have directionality as illustrated in FIG. 14B.
Although a time domain computation is described above, performing the computation in the frequency domain can be said to be equivalent. The equations in such a case are Equation (3) and Equation (4) above. Next, consider as an example changing the arrival direction θ by ±90°. Namely, the directional signal B1 (f) from the first directionality forming section 11 has strong directionality in the right-direction as illustrated in FIG. 15A, and the directional signal B2 (f) from the first directionality forming section 12 has strong directionality in the left-direction as illustrated in FIG. 15A.
The coherence COH is obtained for the directional signals B1 (f) and B2 (f), obtained as described above, by performing a calculation according to Equation (6) and Equation (7) using a coherence calculation section 13. In Equation (6), B2 (f)* is the complex conjugate of B2 (f).
                              coef          ⁡                      (            f            )                          =                                                        B              ⁢                                                          ⁢              1              ⁢                                                (                  f                  )                                ·                B                            ⁢                                                          ⁢              2              ⁢                                                (                  f                  )                                *                                                                                    1              2                        ⁢                          {                                                                                                              B                      ⁢                                                                                          ⁢                      1                      ⁢                                              (                        f                        )                                                                                                  2                                +                                                                                                B                      ⁢                                                                                          ⁢                      2                      ⁢                                              (                        f                        )                                                                                                  2                                            }                                                          (        6        )                                          C          ⁢                                          ⁢          O          ⁢                                          ⁢          H                =                              ∑                          f              =              0                                      M              -              1                                ⁢                                          ⁢                                    coef              ⁡                              (                f                )                                      ⁢                          /                        ⁢            M                                              (        7        )            
In a target-sound segment detection section 14, the coherence COH is compared with a target-sound segment determination threshold value Θ, determination as a target-sound segment is made if the coherence COH is greater than the threshold value Θ, otherwise determination as a non-target-sound segment is made, and the determination results VAD_RES (K) are formed.
A brief description follows regarding the reasoning behind detecting target-sound segments using the magnitude of the coherence. The concept of coherence can also be referred to as the correlation between a signal arriving from the right and a signal arriving from the left (Equation (6) above computes correlations for given frequency components, and Equation (7) calculates the average correlation value for all frequency components). It is therefore possible to say that the two directional signals B1 and B2 have little correlation with each other when the small coherence COH is small, and, conversely, have high correlation with each other when the coherence COH is large. Input signals having little correlation are sometimes cases in which the input arrival direction is offset greatly to either of the right or left, and sometimes non-offset noise-like signals that clearly have little regularity. Thus it can be said that a segment in which the coherence COH is small is an interfering-sound segment or a background noise segment (a non-target-sound segment). It can also be said that the input signal has arrived from the front face when there is large coherence COH, due to there being no offset in the arrival direction. It is assumed that target-sound will arrive from the front face, meaning that large coherence COH can be said to signify target-sound segments.
A gain controller 15 sets a gain VS_GAIN for target-sound segments to 1.0, and sets a gain VS_GAIN for non-target-sound segments (interfering-sounds, background noise) to a freely selected positive value α less than 1.0. A voice switch gain multiplication section 16 obtains a post-voice switch signal y (n) by multiplying the obtained gain VS_GAIN by an input signal s1 (n).