1. Field of the Invention
The present invention relates to a speech detection apparatus for deciding whether an input signal is speech or nonspeech, under a noisy condition in a real life environment which includes speech with various stationary and/or nonstationary noises. More particularly, the present invention relates to a speech detection method and a speech detection apparatus, used for detecting speech period, in a video conference system, an audio reproduction system of television or audio equipment, a speech recognition device or the like.
2. Description of the Related Art
Recently, digital signal processing techniques have been widely used in various fields of electrical equipment. For example, in the field of data transmission equipment, a technique and a device for transmitting image data, as well as speech data, whereby performing a communication with a sense of presence are now under development. Videophone and video conference systems are typical examples of applications of such techniques, wherein a TV screen plays an important role. Especially, a video conference system in which many people may have conversations requires a technique for correctly responding to the voice of a speaker and properly changing the TV screen so as to display the current speaker.
Furthermore, in the audio reproduction system of a television or audio equipment, techniques are under development for adding a reverberation and/or a reflection to a reproduced sound so that a listener may enjoy a sense of presence. When a broad-band signal or a stereo signal of musical sound or the like is reproduced, artificial sounds such as a reverberation sound or a reflection sound may added to the signal so as to result in a desirable effect. However, when a speech signal or a monaural signal is reproduced, these artificial sounds do not necessarily get an intended effect. In some cases, an articulation score of the signal may be degraded. Accordingly, in order to perform an effective audio reproduction by adding the artificial sounds only to nonspeech signals such as a music signal, it is necessary to determine whether the input audio signal is a speech signal or a nonspeech signal.
Moreover, in the system for performing a speech recognition or the like, in a case where a noise which is not speech is input and erroneously judged as speech, it may cause an erroneous recognition. Accordingly, such a system requires a speech detection apparatus capable of correctly deciding whether an input signal is a speech signal or not.
The speech detection is performed mainly based on a power of the input signal; a portion having a power value larger than a predetermined threshold value is judged as a speech signal. This method is quite commonly used, due to the simplicity of processing. However, in a real life environment with various noises, a nonspeech sound having a power larger than the threshold may be input with a high probability. Accordingly, the speech detection based on a single feature of the power may often result in an erroneous decision.
Several methods have been proposed for making a decision whether the input signal is speech or non-speech, by using a plurality of parameters (characteristic quantities) indicative of speech properties besides the power. Such a method is described, e.g., in H. Kobatake, K. Tawa, A. Ishisda, "Speech/Nonspeech Discrimination for Speech Recognition System Under Real Life Noise Environments" Proc. ICASSP, 89, 365-368 (1989). For speech/nonspeech discrimination in a real life environment, this method uses acoustic parameters effective for discriminating between speech sounds and various nonstationary noises which occur at a laboratory or an office in daily life. Specifically, this speech/nonspeech discrimination is performed by using a portion considered to be a vowel in a large-powered part of a speech signal, based on the occupation ratio of the vowel portions to the large-powered part of the speech signal. In speech/nonspeech discrimination, five audio parameters are adopted, i.e., periodicity, pitch frequency, optimum order of linear prediction, distance between five vowels and sharpness of formants. An upper or lower threshold value is set for each of the parameters. Then, five parameters are derived from an input signal, and the speech/nonspeech discrimination is performed based on the relationship between the derived parameters and the set upper or lower threshold value. However, because of a very complicated computation process for deriving parameters and comparing each of them with the upper or lower threshold, this method is time-consuming and thus has disadvantages as a practical method. Furthermore this method is much affected by the variance of the parameter caused by the addition of a stationary noise or the like.
In addition, a method for voiced/unvoiced speech decision has been proposed, though it is not a method for speech/nonspeech (noise) discrimination. For example such a method is described in B. S. Atal, L. R. Rabiner, "A Pattern Recognition Approach to Voiced-unvoiced-silence classification with application to speech recognition", IEEE Trans Acoust., Speech Signal Processing, ASSP-24-3 (1976). In this method five parameters are used log energy of the signal, zero-crossing rate of the signal, autocorrelation coefficients at unit sample delay, first predictor coefficient and log energy of the prediction error. Normal distribution is assumed for each of the parameters and the voiced-unvoiced-silence discrimination is performed by using simultaneous probabilities. However, the discrimination is not correctly performed for stationary noises or noises whose energy predominates in the low-frequency region, although it is effective for noises whose energy predominates in the high-frequency region.