The present invention relates to a voice detecting method and apparatus which are used in switching a coding method to a decoding method between a voice section and a non-voice section in a coding device and a decoding device for transmitting a voice signal at a low bit rate.
In mobile voice communication such as a mobile phone, a noise exists in a background of conversation voice, and however, it is considered that a bit rate necessary for transmission of a background noise in a non-voice section is lower compared with voice. Accordingly, from a use efficiency improvement standpoint for a circuit, there are many cases in which a voice section is detected, and a coding method specific to a background noise, which has a low bit rate, is used in the non-voice section. For example, in an ITU-T standard G.729 voice coding method, less information on a background noise is intermittently transmitted in the non-voice section. At this time, a correct operation is required for voice detection so that deterioration of voice quality is avoided and a bit rate is effectively reduced. Here, as a conventional voice detecting method, for example, “A Silence Compression Scheme for G.729 Optimized for Terminals Conforming to ITU-T V.70” (ITU-T Recommendation G.729, Annex B) (Referred to as “Literature 1”) or a description in a paragraph B.3 (a detailed description of a VAD algorithm) of “A Silence Compression Scheme for standard JT-G729 Optimized for ITU-T Recommendation V.70 Terminals” (Telegraph Telephone Technical Committee Standard JT-G729, Annex B) (Referred to as “Literature 2”) or “ITU-T Recommendation G.729 Annex B: A Silence Compression Scheme for Use with G.729 Optimized for V.70 Digital Simultaneous Voice and Data Applications” (IEEE Communication Magazine, pp. 64–73, September 1997) (Referred to as “Literature 3”) is referred to.
FIG. 6 is a block diagram showing an arrangement example of a conventional voice detecting apparatus. It is assumed that an input of voice to this voice detecting apparatus is conducted at a block unit (frame) of a Tfr msec (for example, 10 msec) period. A frame length is assumed to be Lfr samples (for example, 80 samples). The number of samples for one frame is determined by a sampling frequency (for example, 8 kHz) of input voice.
Referring to FIG. 5, each constitution element of the conventional voice detecting apparatus will be explained.
Voice is input from an input terminal 10, and a linear predictive coefficient is input from an input terminal 11. Here, the linear predictive coefficient is obtained by applying linear predictive analysis to the above-described input voice vector in a voice coding device in which the voice detecting apparatus is used. With regard to the linear predictive analysis, a well-known method, for example, Chapter 8 “Linear Predictive Coding of Speech” in “Digital Processing of Speech Signals” (Prentice-Hall, 1978) (Referred to as “Literature 4”) by L. R. Rabiner, et al. can be referred to. In addition, in case that the voice detecting apparatus in accordance with the present invention is realized independent of the voice coding device, the above-described linear predictive analysis is performed in this voice detecting apparatus.
An LSF calculating circuit 1011 receives the linear predictive coefficient via the input terminal 11, and calculates a line spectral frequency (LSF) from the above-described linear predictive coefficient, and outputs the above-described LSF to a first change quantity calculating circuit 1031 and a first moving average calculating circuit 1021. Here, with regard to the calculation of the LSF from the linear predictive coefficient, a well-known method, for example, a method and so forth described in Paragraph 3.2.3 of the Literature 1 are used.
A whole band energy calculating circuit 1012 receives voice (input voice) via the input terminal 10, and calculates a whole band energy of the input voice, and outputs the above-described whole band energy to a second change quantity calculating circuit 1032 and a second moving average calculating circuit 1022. Here, the whole band energy Ef is a logarithm of a normalized zero-degree autocorrelation function R(0), and is represented by the following equation:
      E    f    =      10    ·                  log        10            ⁡              [                              1            N                    ⁢                      R            ⁢                          (              0              )                                      ]            Also, an autocorrelation coefficient is represented by the following equation:
      R    ⁢          (      k      )        =            ∑              n        =        k                    N        -        1              ⁢                            s          1                ⁢                  (          n          )                    ⁢              s        1            ⁢              (                  n          -          k                )            Here, N is a length (analysis window length, for example, 240 samples) of a window of the linear predictive analysis for the input voice, and S1(n) is the input voice multiplied by the above-described window.
In case of N>Lfr, by holding the voice which was input in the past frame, it shall be voice for the above-described analysis window length.
A low band energy calculating circuit 1013 receives voice (input voice) via the input terminal 10, and calculates a low band energy of the input voice, and outputs the above-described low band energy to a third change quantity calculating circuit 1033 and a third moving average calculating circuit 1023. Here, the low band energy Ei from 0 to Fi Hz is represented by the following equation:
      E    t    =      10    ·                  log        10            ⁡              [                              1            N                    ⁢                                    h              ^                        T                    ⁢                      R            ^                    ⁢                      h            ^                          ]            
Here,
ĥ
is an impulse response of an FIR filter, a cutoff frequency of which is Fl Hz, and
{circumflex over (R)}
is a Teplitz autocorrelation matrix, diagonal components of which are autocorrelation coefficients R(k).
A zero cross number calculating circuit 1014 receives voice (input voice) via the input terminal 10, and calculates a zero cross number of an input voice vector, and outputs the above-described zero cross number to a fourth change quantity calculating circuit 1034 and a fourth moving average calculating circuit 1024. Here, the zero cross number Zc is represented by the following equation:
      Z    c    =            1              2        ⁢                  L          fr                      ⁢                  ∑                  n          =          0                                      L            fr                    -          1                    ⁢                                            sgn            ⁡                          [                              s                ⁢                                  (                  n                  )                                            ]                                -                      sgn            ⁡                          [                              s                ⁢                                  (                                      n                    -                    1                                    )                                            ]                                                  Here, S(n) is the input voice, and sgn[x] is a function which is 1 when x is a positive number and which is 0 when it is a negative number.
The first moving average calculating circuit 1021 receives the LSF from the LSF calculating circuit 1011, and calculates an average LSF in the current frame (present frame) from the above-described LSF and an average LSF calculated in the past frames, and outputs it to the first change quantity calculating circuit 1031. Here, if an LSF in the m-th frame is assumed to beωi[m],i=1, . . . ,Pan average LSF in the m-th frame{overscore (ω)}i[m],i=1, . . . ,Pis represented by the following equation:{overscore (ω)}i[m]=βLSF·{overscore (ω)}i[m-1]+(1−βLSF)·ωi[m],i=1, . . . ,PHere, P is a linear predictive order (for example, 10), and βLSF is a certain constant number (for example, 0.7).
The second moving average calculating circuit 1022 receives the whole band energy from the whole band energy calculating circuit 1012, and calculates an average whole band energy in the current frame from the above-described whole band energy and an average whole band energy calculated in the past frames, and outputs it to the second change quantity calculating circuit 1032. Here, assuming that a whole band energy in the m-th frame is Ef[m], an average whole band energy in the m-th frameĒf[m]is represented by the following equation:Ēf[m]=βEf·Ēf[m-1]+(1−βEf)·Ef[m]Here, βEf is a certain constant number (for example, 0.7).
The third moving average calculating circuit 1023 receives the low band energy from the low band energy calculating circuit 1013, and calculates an average low band energy in the current frame from the above-described low band energy and an average low band energy calculated in the past frames, and outputs it to the third change quantity calculating circuit 1033. Here, assuming that a low band energy in the m-th frame is El[m], an average low band energy in the m-th frameĒl[m]is represented by the following equation:Ēl[m]=βEl·Ēl[m-1]+(1−βEl)·El[m]Here, βEl is a certain constant number (for example, 0.7).
The fourth moving average calculating circuit 1024 receives the zero cross number from the zero cross number calculating circuit 1014, and calculates an average zero cross number in the current frame from the above-described zero cross number and an average zero cross number calculated in the past frames, and outputs it to the fourth change quantity calculating circuit 1034. Here, assuming that a zero cross number in the m-th frame is Zc[m], an zero cross number in the m-th frame{overscore (Z)}c[m]is represented by the following equation:{overscore (Z)}c[m]=βZc·{overscore (Z)}c[m]+(1−βZc)·Zc[m]Here, βZc is a certain constant number (for example, 0.7).
The first change quantity calculating circuit 1031 receives LSF ωi[m] from the LSF calculating circuit 1011, and receives the average LSF{overscore (ω)}i[m]from the first moving average calculating circuit 1021, and calculates spectral change quantities (first change quantities) from the above-described LSF and the above-described average LSF, and outputs the above-described first change quantities to a voice/non-voice determining circuit 1040. Here, the first change quantities ΔS[m] in the m-th frame are represented by the following equation:
      Δ    ⁢                  ⁢          S              [        m        ]              =            ∑              i        =        1            p        ⁢                  (                              ω            i                          [              m              ]                                -                                    ω              _                        i                          [              m              ]                                      )            2      
The second change quantity calculating circuit 1032 receives the whole band energy Ef[m] from the whole band energy calculating circuit 1012, and receives the average whole band energyĒf[m]from the second moving average calculating circuit 1022, and calculates whole band energy change quantities (second change quantities) from the above-described whole band energy and the above-described average whole band energy, and outputs the above-described second change quantities to the voice/non-voice determining circuit 1040. Here, the second change quantities ΔEf[m] in the m-th frame are represented by the following equation:ΔEf[m]=Ēf[m]=Ef[m]
The third change quantity calculating circuit 1033 receives the low band energy El[m] from the low band energy calculating circuit 1013, and receives the average low band energyĒl[m]from the third moving average calculating circuit 1023, and calculates low band energy change quantities (third change quantities) from the above-described low band energy and the above-described average low band energy, and outputs the above-described third change quantities to the voice/non-voice determining circuit 1040. Here, the third change quantities ΔEl[m] in the m-th frame are represented by the following equation:ΔEl[m]=Ēl[m]−El[m]
The fourth change quantity calculating circuit 1034 receives the zero cross number Zc[m] from the zero cross number calculating circuit 1014, and receives the zero cross number{overscore (Z)}c[m]from the fourth moving average calculating circuit 1024, and calculates zero cross number change quantities (fourth change quantities) from the above-described zero cross number and the above-described average zero cross number, and outputs the above-described fourth change quantities to the voice/non-voice determining circuit 1040. Here, the fourth change quantities ΔZc[m] in the m-th frame are represented by the following equation:ΔZc[m]={overscore (Z)}c[m]−Zc[m]
The voice/non-voice determining circuit 1040 receives the first change quantities from the first change quantity calculating circuit 1031, receives the second change quantities from the second change quantity calculating circuit 1032, receives the third change quantities from the third change quantity calculating circuit 1033, and receives the fourth change quantities from the fourth change quantity calculating circuit 1034, and the voice/non-voice determining circuit determines that it is a voice section when a four-dimensional vector consisting of the above-described first change quantities, the above-described second change quantities, the above-described third change quantities and the above-described fourth change quantities exists within a voice region in a four-dimensional space, and otherwise, the voice/non-voice determining circuit determines that it is a non-voice section, and sets a determination flag to 1 in case of the above-described voice section, and sets the determination flag to 0 in case of the above-described non-voice section, and outputs the above-described determination flag to a determination value smoothing circuit 1050. For the determination of the voice and the non-voice (voice/non-voice determination), for example, 14 kinds of boundary determination described in Paragraph B.3.5 of the Literatures 1 and 2 can be used.
The determination value correcting circuit 1050 receives the determination flag from the voice/non-voice determining circuit 1040, and receives the whole band energy from the whole band energy calculating circuit 1012, and corrects the above-described determination flag in accordance with a predetermined condition equation, and outputs the corrected determination flag via the output terminal. Here, the correction of the above-described determination flag is conducted as follows: If a previous frame is a voice section (in other words, the determination flag is 1), and if the energy of the current frame exceeds a certain threshold value, the determination flag is set to 1. Also, if two frames including the previous frame are continuously the voice section, and if an absolute value of a difference between the energy of the current frame and the energy of the previous frame is less than a certain threshold value, the determination flag is set to 1. On the other hand, if past ten frames are non-voice sections (in other wards, the determination flag is 0), and if a difference between the energy of the current frame and the energy of the previous frame is less than a certain threshold value, the determination flag is set to 0. For the correction of the determination flag, for example, a condition equation described in Paragraph B.3.6 of the Literatures 1 and 2 can be used.
The above-mentioned conventional voice detecting method has a task that there is a case in which a detection error in the voice section (to erroneously detect a non-voice section for a voice section) and a detection error in the non-voice section (to erroneously detect a voice section for a non-voice section) occur.
The reason thereof is that the voice/non-voice determination is conducted by directly using the change quantities of spectrum, the change quantities of energy and the change quantities of the zero cross number. Even though actual input voice is the voice section, since a value of each of the above-described change quantities has a large change, the actual input voice does not always exist in a value range predetermined in accordance with the voice section. Accordingly, the above-described detection error in the voice section occurs. This is the same as in the non-voice section.