1. Field of the Invention
The invention relates to a voice detector for detecting the presence/absence of a speech element in a voice signal, and more specifically to a detector adapted to use with a telephone, a navigation system, voice recognition equipment, a radio device or recording equipment, and which has a function to change a procedure according to the presence/absence of the speech element.
2. Description of the Background Art
A first conventional voice detector calculates a long-term weighted average value and a short-term weighted average value, of a voice signal level, and holds a fixed off-set, e.g., 6 dB with the calculated long-term weighted average value showing a smooth changing characteristic. If the short-term weighted average value exceeds a threshold value which is a value equal to the long-term weighted average value and the off-set, the detector identifies the voice signal as the voiced element.
A second conventional voice detector is disclosed in Japanese laid-open patent application 8-202,394. The voice detector detects a power of a voice signal in a predetermined fixed frame, then determines the presence/absence of the speech element. The following is an explanation of the second conventional voice detector described in the Japanese application.
First, a voice power calculator calculates a voice power of a fixed frame in a sample. A maximum value detector inputs a voice power signal based on the calculation of voice power by the power calculator, and detects the maximum value of the voice power within the fixed frame and respective front and the rear frames just before one of the fixed frames then outputs a maximum value signal based on the detected maximum voice to a discriminator. A zero-crossing rate calculator calculates the zero-crossing rate from the voice signal and outputs a resulting signal to the discriminator. Based on the maximum value signal received from the maximum value detector and the resulting signal from zero-crossing rate calculator on a frame, the discriminator determines whether the frame is a voiced frame or an unvoiced frame by using a threshold value set by a threshold value calculator. The discriminator outputs a frame type signal, e.g., 1 in case of a voiced frame, 0 in case of an unvoiced frame, to a hangover generator. When the frame type changes from voiced frame to unvoiced, the changeover generator output changes from the resulting frame type signal shown the unvoiced to the signal shown the voiced and outputs the resulting signal during a predetermined frames from the changed frame. The threshold value calculator watches the change of the voice power within a period defined by the discrimination result output by the discriminator, and renews the threshold value. In the second conventional detector, the reason why the maximum value detector detects the maximum value of the voice power within the frames, including the front and the rear frames, is as follows. The voice power is usually small just after the start of an utterance (the start of the utterance) and just before the end of the utterance (the end of the utterance). When the start of the utterance exists at the end of a preceding frame (front) and the end of the utterance exists at the start of a succeeding frame (rear), it is likely that the detector would mistakenly discriminate the current frame (the frame between the preceding and succeeding frames)as an unvoiced frame if the detection considered the voice power within the current frame alone. However, since the detector detects the maximumvalue of the voice power within the frames by including the front and rear frames as well, it can discriminate the value correctly.
However, in the first conventional voice detector, the threshold value is set based only on the long-term weighted average value, and the short-termweighted average value rapidly changes. Therefore, the short-term weighted average value repeatedly exceeds and does not exceed the threshold value, and alternately as a result the detector often discriminates voiced/unvoiced frames incorrectly. Also since the short-term weighted average value rapidly changes as a result of the rapid change of the noise, the short-term weighted average value repeatedly exceeds and does not exceed the threshold value, and again the detector similarly discriminates the voiced/unvoiced frames incorrectly.
Also, the above-described conventional voice detector has various problem left unsolved. For example, since the maximum value detector detects the maximum power value in the preceded frame and the discriminator discriminates the voiced/unvoiced frames based on the power value, it misdiscriminates rapid changes of noise within a frame as a voice element.
In the second conventional voice detector, the detector names the voice power signal during a predetermined period in a frame and watches the change of the power in the frame. If the change of the power is smaller than the threshold value during the predetermined period, the detector discriminates the frame as background noise, estimates the power of the background noise inputted during the period and also determines the threshold value. Therefore, when the background noise rapidly become small, the detector mistakenly discriminates the change of the noise level as a change in voice, in other words, discriminates the frame as the voiced frame. The detector identifies an estimated level of a background noise to be greater than the actual level. And the detector identifies a signal which should be identified as voiced instead as a signal within the background noise level. Especially, an incorrect identification often occurs at the beginning of an utterance and at the end of an utterance. In other words, the beginnings and endings of utterances that occur during frames that follow rapid changes in background voice are often mistakenly identified as unvoiced.