The present invention relates to a speech segment/noise segment determination apparatus to be used with a speech device, such as a portable cellular phone or a mobile phone, which determines whether a signal of an acquired segment includes only a noise or both a noise and a speech signal. More particularly, the noise segment/speech segment determination apparatus is constructed so as to be able to determine, with a high level of reliability, whether an acquired segment is a noise segment or a speech segment.
In recent years, an apparatus capable of taking speech as input information has been used under various circumstances. For this reason, ability for use of the apparatus under the influence of noise has become important. Portable cellular phones and mobile phones are examples of such an apparatus. Thanks to progress in IC technology, there has been adopted a noise suppressor which employs a fairly high-level digital signal processing technique by use of a digital signal processor (DSP).
Such a noise suppressor is used in conjunction with a device for determining whether or not a signal of a captured segment corresponds to a noise-only segment or to a speech signal segment. The quality of the device greatly affects the performance of the noise suppressor. A noise segment/speech segment determination device employed in a conventional noise suppressor will be described by reference to the accompanying drawings.
FIG. 21 is a block diagram showing a noise suppressor having a related-art noise segment/speech segment determination device. A noise segment/speech segment determination device 1100 enclosed by dotted lines in FIG. 21 comprises an analog-to-digital conversion section 1101; an extraction section 1102; and a noise segment/speech segment determination section 1103. Further, the noise segment/speech segment determination device 1100 has an input terminal 1 for receiving an analog speech signal including noise, a speech segment determination output terminal 2, and a noise segment determination output terminal 3. The noise suppressor is constructed such that a signal output from the extraction section 1102, a signal output from the speech segment determination output terminal 2, and a signal output from the noise segment determination output terminal 3 are delivered to a noise suppression device 1104.
The first to third related-art noise segment/speech segment determination device 1100 used in the noise suppression device 1104 is now described by reference to FIG. 21.
An analog speech signal—which has been converted into an electric signal by means of an unillustrated microphone and includes ambient noise—is input to the noise segment/speech segment determination device 1100 via the input terminal 1. The analog speech signal is converted into a digital signal by means of the analog-to-digital conversion section 1101. The digital signal is taken into a frame of given interval; e.g., 10 [ms]. The digital signal taken into the frame is input simultaneously to the noise segment/speech segment determination section 1103 and to the noise suppression device 1104.
The noise segment/speech segment determination section 1103 determines whether the input signal corresponds to a noise-only signal segment or a noise-including speech signal segment, and outputs a result of determination to the noise suppression device 1104. On the basis of a determination result signal output from the noise segment/speech segment determination section 1103, the noise suppression device 1104 processes a signal delivered from the extraction section 1102, thereby outputting a noise-suppressed speech signal.
Related-art technologies pertaining to a determination operation to be performed by the noise segment/speech segment determination section 1103 will now be described. A first example of related-art technology will be described. In relation to a speech signal which is input to the noise segment/speech segment determination section 1103 and includes ambient noise, a signal segment which includes no speech signal and only noise should be lower in level than a signal segment including a speech signal. Accordingly, mean power of each frame of an input signal is compared with a predetermined threshold value. If the power exceeds the threshold value, the frame can be determined to be a noise-including speech signal segment. In contrast, if the power does not exceed the threshold value, the frame can be determined to be a noise segment.
A second example of related-art technology will next be described. A second example of related-art technology is a method of changing the threshold value to be used for determination, so as to follow changes in ambient noise. For instance, one frame takes an interval of 10 [ms], and mean power of the frame is measured. For instance, mean power is measured every five seconds, and the minimum mean power is taken as a threshold value for determining a noise segment/speech segment over the next five seconds. In this case, a threshold value for determination can be changed every five seconds. The translated versions of Japanese Patent Publication Nos. H3-500347 and H10-513030 describe a method of changing a threshold value for determining a noise segment and speech segment so as to follow changes in ambient noise.
Next will be described a third example of related-art technology; that is, a known technique of using the “number of short-time zero crossings” described in Japanese Patent Publication No. H8-294197. As shown in FIG. 21, a speech signal including ambient noise is converted into a digital signal by means of the analog-to-digital conversion section 1101. The number of times consecutive sample values corresponding to a digital signal output change from positive to negative or vice versa is accumulated for a certain period of time. If sample values include speech, an accumulated value becomes higher than that obtained by counting noise-only sample values. The accumulated value is compared with a predetermined threshold value. If the accumulated value is greater than the threshold value, a corresponding segment can be determined to be a speech signal segment. If the accumulated value is lower than the threshold value, a corresponding segment can be determined to be a noise segment. The first certain period of time at the beginning of communication is deemed as a period during which a user has not yet uttered speech and only ambient noise is present. The accumulated value of the period is determined to be an accumulated value of the noise segment. Only when an accumulated value for a certain period of time is greater than a value which is five times the accumulated value of the first segment, the period is taken as a speech period.
A method described in Japanese Patent Publication No. Sho 58-143394 will now be described as a fourth related-art example. The first and second related-art examples utilize the phenomenon that a mean level of a speech segment is greater than that of a noise segment. If ambient noise becomes great to the same level as that of the speech signal, distinguishing between a speech segment and a noise segment becomes difficult. In contrast, the fourth method enables rendering of a distinction between a noise segment and a speech segment regardless of the magnitude of ambient noise. The outline of the method will be described hereinbelow.
First, speech comprises voiced sounds and voiceless sounds. The voiced sounds correspond to ordinary vowel and consonant sounds, and the voiceless sounds correspond to fricative sounds and plosives. The voiced sounds are considered to take, as a sound source, an iterative pulse train of given cycle called a pitch and the voiceless sounds are considered to take, as a sound source, a random pulse train. Further, the pulse trains are considered to be uttered from the mouth as speech via the vocal tract. The method determines an input signal of a certain segment as a voiced sound segment, a voiceless sound segment, or a noise segment regardless of a mean power level of the segment. The method will further be described by reference to FIG. 22.
As shown in FIG. 22, a related-art fourth noise segment/speech segment termination device comprises the analog-to-digital conversion section 1101; the extraction section 1102; an auto-correlation function computation section 1201; a linear prediction section 1202; a normalized residual correlation function computation section 1203; a normalized power rating computation section 1204; and a noise segment/speech segment determination section 1205. The analog-to-digital conversion section 1101 and the extraction section 1102 are the same as those described in connection with FIG. 21. Further, the noise segment/speech segment determination section 1205 has the speech segment determination output terminal 2 and the noise segment determination output terminal 3, in the same manner as described in connection with FIG. 21. Hence, repeated explanations thereof are omitted.
A speech signal input including ambient noise is converted into a digital signal by means of the analog-to-digital conversion section 1101. The extraction section 1102 takes the thus-converted digital into a frame having an interval of, e.g., 10 [ms]. Given that a sampling frequency is 8 [kHz], 80 samples are taken. The signal is input to the auto correlation function computation section 1201, and there is obtained an autocorrelation function up to an analysis order of “p”; that is, R(0), R(1), . . . R(p). In the case of an ordinary speech signal, the analysis order “p” assumes a value of about 10. Provided that a sample value of an input signal is represented as s (n), formula (1) holds, as follows.                               R          ⁡                      (            j            )                          =                              (                          1              /              80                        )                    ⁡                      [                                          ∑                                  n                  =                  0                                                  n                  =                  79                                            ⁢                                                s                  ⁡                                      (                    n                    )                                                  *                                  s                  ⁡                                      (                                          n                      -                      j                                        )                                                                        ]                                              (        1        )            
The autocorrelation function R(0), R(1), . . . R(p) is input to the linear prediction section 1201. The linear prediction section 1202 linearly predicts an input signal in the following manner, through use of values of the autocorrelation function. Since an acquired speech signal has a degree of redundancy, a present sample can be predicted from a sample taken in the past. However, perfect prediction of a present sample is impossible, and hence an error remains. A predicted value “S′(n)” is expressed by the following formula (2).                                           s            ′                    ⁡                      (            n            )                          =                  -                                    ∑                              j                =                1                                            j                =                p                                      ⁢                                          a                j                            ⁢                              s                ⁡                                  (                                      n                    -                    j                                    )                                                                                        (        2        )            
Data up to a sample “p” in the past are predicted. A prediction error e(n) is expressed by the following formula (3).                               e          ⁡                      (            n            )                          =                                            s              ⁡                              (                n                )                                      -                                          s                ′                            ⁡                              (                n                )                                              =                                    ∑                              j                =                0                                            j                =                p                                      ⁢                                          a                j                            ⁢                              s                ⁡                                  (                                      n                    -                    j                                    )                                                                                        (        3        )                            where, a0=1        
Here, a1, a2, . . . ap are selected such that a root mean square (RMS) of formula (3) is minimized.
To this end, values of a1, a2, . . . ap sought by solution of the following formula (4) are employed.                                           [                                                                                R                    ⁡                                          (                      0                      )                                                                                                            R                    ⁡                                          (                      1                      )                                                                                                            R                    ⁡                                          (                      2                      )                                                                                        ⋯                                                                      R                    ⁡                                          (                                              p                        -                        1                                            )                                                                                                                                        R                    ⁡                                          (                      1                      )                                                                                                            R                    ⁡                                          (                      0                      )                                                                                                            R                    ⁡                                          (                      1                      )                                                                                        ⋯                                                                      R                    ⁡                                          (                                              p                        -                        2                                            )                                                                                                                                        R                    ⁡                                          (                      2                      )                                                                                                            R                    ⁡                                          (                      1                      )                                                                                                            R                    ⁡                                          (                      0                      )                                                                                        ⋯                                                                      R                    ⁡                                          (                                              p                        -                        3                                            )                                                                                                                                        R                    ⁡                                          (                      3                      )                                                                                                            R                    ⁡                                          (                      2                      )                                                                                                            R                    ⁡                                          (                      1                      )                                                                                        ⋯                                                                      R                    ⁡                                          (                                              p                        -                        4                                            )                                                                                                                    ⋮                                                  ⋮                                                  ⋮                                                  ⋰                                                  ⋮                                                                                                  R                    ⁡                                          (                                              p                        -                        1                                            )                                                                                                            R                    ⁡                                          (                                              p                        -                        2                                            )                                                                                                            R                    ⁡                                          (                                              p                        -                        3                                            )                                                                                        ⋯                                                                      R                    ⁡                                          (                      0                      )                                                                                            ]                    *                      [                                                                                a                    1                                                                                                                    a                    2                                                                                                                    a                    3                                                                                                                    a                    4                                                                                                ⋮                                                                                                  a                    p                                                                        ]                          =                  [                                                                      R                  ⁡                                      (                    1                    )                                                                                                                        R                  ⁡                                      (                    2                    )                                                                                                                        R                  ⁡                                      (                    3                    )                                                                                                                        R                  ⁡                                      (                    4                    )                                                                                                      ⋮                                                                                      R                  ⁡                                      (                    p                    )                                                                                ]                                    (        4        )            
A partial autocorrelation function kj(j=1, 2, . . . p) and a normalized residual signal are obtained during the course of seeking linear prediction coefficients, a1, a2, . . . ap. The partial autocorrelation function kj is expressed by the following formulas (5) and (6).k1=R(1)/R(0)  (5)k2={(R(2)/R(0))−(R(1)/R(0))2}/{1−(R(1)/R(0))2}  (6)
Partial autocorrelation functions k3 and beyond are omitted and can be expressed through use of R(0), R(1), . . . R(p). As can be seen from formulas (5) and (6), the value of kj is normalized by R(0) representing mean power and is irrelevant to the power of an input signal. A normalized residual signal is expressed by formula (7).                                           e            r                    ⁡                      (            n            )                          =                              ∑                          j              =              0                                      j              =              p                                ⁢                                    a              j                        ⁢                                          s                ⁡                                  (                                      n                    -                    j                                    )                                            /                                                (                                      R                    ⁡                                          (                      0                      )                                                        )                                                  1                  /                  2                                                                                        (        7        )                            where, a0=1        
Here, ai (i=1, 2, . . . p) is a linear prediction coefficient and is to be computed by the linear prediction section 1202. To be more precise, a partial autocorrelation function kj (j=1, 2, . . . p) is sought during the course of seeking the linear prediction coefficient ai (i=1, 2, . . . p). The linear prediction coefficient is input to the normalized residual coefficient function computation section 1203. The partial autocorrelation function kj (j=1, 2, . . . p) is input to the normalized power rating computation section 1204, and k1 is input to the noise segment/speech segment determination section 1205. The normalized power rating computation section 1204 computes a normalized power rating according to formula (8), and the thus-computed normalized power rating is input to the noise segment/speech segment determination section 1205.                               E          N                =                              ∑                          j              =              1                                      j              =              p                                ⁢                      (                          1              -                              k                j                2                                      )                                              (        8        )                            where, p is an analysis order        
The normalized residual correction function computation section 1203 computes an autocorrelation function of a normalized residual signal expressed by the following formula (9).                               Φ          ⁡                      (            j            )                          =                              (                          1              /              80                        )                    ⁢                                    ∑                              n                =                0                                            n                =                79                                      ⁢                          [                                                                    e                    r                                    ⁡                                      (                    n                    )                                                  *                                                      e                    r                                    ⁡                                      (                                          n                      -                      j                                        )                                                              ]                                                          (        9        )            
Next, the maximum value φ of Φ (j) computed by formula (9) is selected, and the thus-selected maximum value φ is input to the noise segment/speech segment determination section 1205. The maximum value φ of Φ (j) is expressed by the following formula (10).                     ϕ        =                              Max            ⁢                          {                              Φ                ⁡                                  (                  j                  )                                            }                                =                      Max            ⁢                          {                                                (                                      1                    /                    80                                    )                                ⁡                                  [                                                            ∑                                              n                        =                        0                                                                    n                        =                        79                                                              ⁢                                                                                            e                          r                                                ⁡                                                  (                          n                          )                                                                    *                                                                        e                          r                                                ⁡                                                  (                                                      n                            -                            j                                                    )                                                                                                      ]                                            }                                                          (        10        )            
The noise segment/speech segment determination section 1205 determines whether or not a signal of an acquired segment is a noise segment or a speech segment by using the following computed three parameters as described above, regardless of a mean power level of the segment.k1=R(1)/R(0)  (5)                              E          N                =                              ∑                          j              =              1                                      j              =              p                                ⁢                      (                          1              -                              k                j                2                                      )                                              (        8        )                            where, p is an analysis order                     ϕ        =                              Max            ⁢                          {                              Φ                ⁡                                  (                  j                  )                                            }                                =                      Max            ⁢                          {                                                (                                      1                    /                    80                                    )                                ⁡                                  [                                                            ∑                                              n                        =                        0                                                                    n                        =                        79                                                              ⁢                                                                                            e                          r                                                ⁡                                                  (                          n                          )                                                                    *                                                                        e                          r                                                ⁡                                                  (                                                      n                            -                            j                                                    )                                                                                                      ]                                            }                                                          (        10        )                    
If necessary, for the significance of formulas (5), (8), and (10), please refer to “Speech Sound” by Kazuo NAKATA (Corona Publishing Co. Ltd.), 3.2.5 and 3.2.6, Chapter 3, 1977, or “Computer Speech Processing” by AGUI and NAKAJIMA (Sanpo Publication Inc.), Chapter 2, 1980.
FIG. 23 shows details of a decision. As shown in FIG. 23, the horizontal axis represents EN, and the vertical axis represents k1. Regions which can be determined by combination of these values EN and k1 are determined as a voiced sound, a voiceless sound, or noise. Regions which cannot be determined through use of only EN and k1 are determined as a voiced sound/voiceless sound or a voiced sound/noise. By means of the value of φ, when φ assumes a value greater than 0.3, a corresponding region is taken as a voiced sound, and when φ assumes a value lower than 0.3, a corresponding region is taken as a voiceless sound or noise.
The noise segment/speech segment determination devices set forth suffer the following problems.
(1) The noise segment/speech segment determination devices relating to the first and second related-art examples cannot determine whether a signal of an acquired segment is a noise segment or a speech segment, when noise becomes high to the same level as that of a speech signal.
(2) The noise segment/speech segment determination device relating to the third related-art example enables rendering of a determination as to whether a signal of acquired segment is a noise segment or a speech segment, regardless of a noise level. However, in practice, the determination device is influenced by a signal-to-noise ratio of a speech signal, and hence acquisition of a determination of sufficient accuracy is difficult.
(3) The noise segment/speech segment determination device relating to the fourth related-art example enables rendering of a determination as to whether a signal of an acquired segment is a noise segment or a speech segment, regardless of a noise level. However, in practice, the reliability of determination is insufficient for reasons of variations, and hence an accurate determination as to whether or not a signal of an acquired segment is a noise segment or a speech segment cannot be made.