In the field of, e.g., speech processing, a technique for detecting speech periods is often required. Detection of speech periods is generally referred to as VAD (Voice Activity Detection) and is also referred to as speech activity detection or speech endpointing.
Typical cases that require VAD include the following two cases.
The first case is a speech communication system. FIG. 1 shows an example of a speech signal transmission/reception procedure in the speech communication system. Basically, a front-end processing unit 11 performs predetermined front-end processing for a speech signal input on the transmitting side, and an encoder 13 encodes the processed signal. After that, the encoded speech is sent to the receiving side through a communication line 15. On the receiving side, a decoder 16 decodes the encoded speech and outputs speech. As described above, a speech signal is sent to another place through the communication line 15. In this case, the communication line 15 has some limitations. The limitations result from, e.g., a heavy usage charge and small transmission capacity. A VAD 12 is used to cope with such limitations. The use of the VAD 12 makes it possible to give an instruction to suspend communication while the user does not utter. As a result, a usage charge can be reduced or another user can utilize the communication line during the suspension. Although not always necessary, front-end processing units to be provided on the preceding stages of the VAD 12 and encoder 13 can be integrated into the front-end processing unit 11 common to the VAD 12 and encoder 13, as shown in FIG. 1. With the VAD 12, the encoder 13 itself need not distinguish between speech pauses and long periods of silence.
The second case is an Automatic Speech Recognition (ASR) system. FIG. 2 shows a processing example of an ASR system including a VAD. In FIG. 2, a VAD 22 prevents a speech recognition process in an ASR unit 24 from recognizing background noise as speech. In other words, the VAD 22 has a function of preventing an error of converting noise into a word. Additionally, the VAD 22 makes it possible to more skillfully manage the throughput of the entire system in a general ASR system that utilizes many computer resources. For example, control of a portable device by speech is allowed. More specifically, the VAD distinguishes between a period during which the user does not utter and that during which the user issues a command. As a result, the apparatus can so control as to concentrate on other functions while speech recognition is not in progress and concentrate on ASR while the user utters. In this example as well, a front-end processing unit 21 on the input side of the VAD 22 and ASR unit 24 can be shared by the VAD 22 and ASR unit, as shown in FIG. 2. In this example, a speech endpoint detection module 23 uses a VAD signal to distinguish between periods between starts and ends of utterances and pauses between words. This is because an ASR unit 24 must accept as speech the entire utterance without any gaps.
To detect a speech period at high precision, background noise needs to be taken into consideration. Since background noise varies every moment, the variation must be tracked and reflected in the VAD metric. It is, however, difficult to implement high-precision tracking. There have conventionally been made various proposals in such terms. Conventional examples will be described briefly below.
Typical examples of conventional VAD methods include one using a time-domain analysis result such as energy or zero-crossing count. However, a parameter obtained from a time-domain process is susceptible to noise. To cope with this, U.S. Pat. No. 5,692,104 discloses a method of detecting a speech period at high precision on the basis of a frequency-domain analysis.
U.S. Pat. No. 5,432,859 and Jin Yang, “Frequency domain noise suppression approaches in mobile telephone systems”, Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume II, pp. 363-366, 1993 is related to a technique for detecting speech while suppressing noise. These references describe that a signal-to-noise ratio (SNR) is a useful VAD metric.
U.S. Pat. Nos. 5,749,067 and 6,061,647 disclose a VAD technique which continuously updates a noise estimate. A noise estimation unit is controlled by the second auxiliary VAD.
U.S. Pat. No. 5,963,901 discloses a VAD technique using a sub-decision for each spectral band.
Jongseo Sohn and Wonyong Sung, “A Voice Activity Detector employing soft decision based noise spectrum adaptation”, Proceedings of the IEEE international Conference on Acoustics, Speech and Signal Processing, pp. 365-368, May 1998 discloses a VAD technique based on a likelihood ratio. In the technique, only speech and noise parameters are used.
The above-mentioned prior-art techniques have the following problems.
(Problem 1)
In the prior-art techniques as described above, there is no method of designating a signal-to-noise ratio between a typical speech signal and background noise. For this reason, certain types of noise may be classified as speech by mistake. One characteristic feature of the present invention is to provide a means for setting a signal-to-noise ratio in advance and thereby execute formulation by MAP (maximum a-posteriori method). This makes it possible to reduce the speech detection sensitivity for certain types of noise.
(Problem 2)
The typical prior-art techniques make no assumption about the spectrum shape of a speech signal. For this reason, loud noise may be classified as speech by mistake. Another characteristic feature of the present invention lies in that a difference spectral metric is used to distinguish between certain types of noise (whose frequency shape is flat) and speech (whose frequency shape is not flat).
(Problem 3)
In the prior-art techniques, only periods during which background noise appears are used to update noise tracking. In such periods, the minimum tracking ratio must be used to track only low-frequency variations at high precision. Since no explicit minimum value is given in the prior art, the MAP method may track high-frequency variations as well. Still another characteristic feature of the present invention is a noise tracking method with a minimum tracking ratio.