This invention relates to an utterance boundary detecting apparatus for use in a speech recognition system.
In speech recognition systems, utterance boundaries are detected during pre-processing. Utterance boundary detection extracts utterance boundaries from a continuous speech signal. It is possible to relatively readily detect such utterance boundaries when the signal-to-noise (S/N) ratio is high (for example, a speech sound of above 30dB as an energy S/N ratio is treated) and the background noise level does not vary much.
A conventional utterance boundary detecting system extracts a speech sound (corresponding to words uttered) through a broad-band microphone and calculates the short-time energies and zero-crossing rate of extracted input speech signals. The utterance boundary is detected by determining the period in which the short-time energy and zero crossing rate continuously exceed their fixed threshold values for a predetermined time period.
In the detecting system using such fixed threshold values, if the background noise level varies time-wise to some extent, the following problem arises. If the fixed threshold value is set at a lower level, the background noise level will exceed the threshold level when it goes somewhat high, there being a disadvantage that the noise is taken as a part of an utterance boundary. If, on the other hand, the fixed threshold level is set at a higher level, it is not possible to extract a lower level speech signal during an utterance boundary. In order to solve such problem, a system is known which is adapted to detect an utterance boundary by determining a threshold value corresponding to the background noise level. That is, this system calculates each average value of the short-time energies and zero crossing rate of the input speech signal during a time interval which is regarded as a silent interval before the utterance of the speech signal, determines a threshold value obtained by adding a predetermined fixed bias value to the respective average value and detects the utterance boundary using such threshold value.
Even if this case, if a greater variation in the background noise level occurs, it is not possible to accurately detect the utterance boundary on the basis of such threshold value obtained. Now suppose that a fixed bias value is set at a lower level. In this case, the short-time energy and zero crossing rate exceed their threshold values and, as a result, noise intervals often occur. That is, the noise interval may occur as a part of the utterance boundary and/or only the noise interval may be detected as the utterance boundary, causing a seriously erroneous operation. If, on the other hand, the fixed bias value is set at a higher level, the portion or whole of the utterance boundary is dropped, causing an erroneous operation.