1. Field of the Invention
The present invention relates to a speech detection apparatus for detecting speech segments in audio signals appearing in such fields as the ATM (asynchronous transfer mode) communication, DSI (digital speech interpolation), packet communication and speech recognition.
2. Description of the Background Art
An example of a conventional speech detection apparatus for detecting speech segments in audio signals is shown in FIG. 1.
This speech detection apparatus of FIG. 1 comprises: an input terminal 100 for inputting audio signals; a parameter calculation unit 101 for acoustically analyzing the input audio signals frame by frame to extract parameters, such as energy, zero-crossing rates, auto-correlation coefficients and spectra; a standard speech pattern memory 102 for storing standard speech patterns prepared in advance; a standard noise pattern memory 103 for storing standard noise patterns prepared in advance; a matching unit 104 for judging whether the input frame is speech or noise by comparing parameters with each of the standard patterns; and an output terminal 105 for outputting a signal which indicates the input frame as speech or noise according to a judgment by matching unit 104.
In the speech detection apparatus of FIG. 1, audio signals from the input terminal 100 are acoustically analyzed by the parameter calculation unit 101, and then parameters such as energy, zero-crossing rates, auto-correlation coefficients and spectra are extracted frame by frame. Using these parameters, the matching unit 104 decides if the input frame is speech or noise. The decision algorithm, such as the Bayer Linear Classifier, can be used in making this decision. The output terminal 105 then outputs the decision made by the matching unit 104. Another example of a conventional speech detection apparatus for detecting speech segments in audio signals is shown in FIG. 2.
This speech detection apparatus of FIG. 2 uses only energy as the parameter, and comprises: an input terminal 100 for inputting audio signals; an energy calculation unit 106 for calculating the energy P(n) of each input frame; a threshold comparison unit 108 for judging whether the input frame is speech or noise by comparing the calculated energy P(n) of the input frame with a threshold T(n); a threshold updating unit 107 for updating the threshold T(n) to be used by the threshold comparison unit 108; and an output terminal 105 for outputting a signal which indicates that the input frame is speech or noise, according to the judgment made by the threshold comparison unit 108.
In the speech detection apparatus of FIG. 2, for each input frame from the input terminal 100, the energy P(n) is calculated by the energy calculation unit 106.
Then, the threshold updating unit 107 updates the threshold T(n) to be used by the threshold comparison unit 108, as follows. When the calculated energy P(n) and the current threshold T(n) satisfy the following relation (1): EQU P(n)&lt;T(n)-P(n).times.(.alpha.-1) (1)
where .alpha. is a constant and n is a sequential frame number, then threshold T(n) is updated to a new threshold T(n+1), according to the following expression (2): EQU T(n+1)=P(n).times..alpha. (2)
On the other hand, when the calculated energy P(n) and the current threshold T(n) satisfy the following relation (3): EQU P(n).gtoreq.T(n)-P(n).times.(.alpha.-1) (3)
then the threshold T(n) is updated to a new threshold T(n+1) according to the following expression (4): EQU T(n+1)=T(n).times..gamma. (4)
where .gamma. is a constant.
Alternatively, the threshold updating unit 108 may update the threshold T(n) to be used by the threshold comparison unit 108 as follows. That is, when the calculated energy P(n) and the current threshold T(n) satisfy the following relation (5): EQU P(n)&lt;T(n)-.alpha. (5)
where .alpha. is a constant, then the threshold T(n) is updated to a new threshold T(n+1) according to the following expression (6): EQU T(n+1)=P(n)+.alpha. (6)
and when the calculated energy P(n) and the current threshold T(n) satisfy the following relation (7): EQU P(n).gtoreq.T(n)-.alpha. (7)
then the threshold T(n) is updated to a new threshold T(n+1) according to the following expression (8): EQU T(n+1)=T(n)+.gamma. (8)
where .gamma. is a small constant.
Then, at the threshold comparison unit 108, the input frame is recognized as a speech segment if the energy P(n) is greater than the current threshold T(n). Otherwise, the input frame is recognized as a noise segment. The result of this recognition obtained by the threshold comparison unit 108 is then outputted from the output terminal 105. Now, such a conventional speech detection apparatus has the following problems. Namely, under a heavy background noise or a low speech energy environment, the parameters of speech segments are affected by the background noise. In particular, some consonants are severely affected because their energies are lowerer than the energy of the background noise. Thus, in such a circumstance, it is difficult to judge whether the input frame is speech or noise, and discrimination errors frequently occur.