In the field of, e.g., speech processing, a technique for detecting speech periods is often required. Detection of speech periods is generally referred to as VAD (Voice Activity Detection). Particularly, in the field of speech recognition, a technique for detecting both the beginning point and the ending point of a significant unit of speech such as a word or phrase (referred to as the endpoint detection) is very critical.
FIG. 1 shows an example of a conventional Automatic Speech Recognition (ASR) system including a VAD and an endpoint detection. In FIG. 1, a VAD 22 prevents a speech recognition process in an ASR unit 24 from recognizing background noise as speech. In other words, the VAD 22 has a function of preventing an error of converting noise into a word. Additionally, the VAD 22 makes it possible to more skillfully manage the throughput of the entire system in a general ASR system that utilizes many computer resources. For example, control of a portable device by speech is allowed. More specifically, the VAD distinguishes between a period during which the user does not utter and that during which the user issues a command. As a result, the apparatus can so control as to concentrate on other functions while speech recognition is not in progress and concentrate on ASR while the user utters.
In this example as well, a front-end processing unit 21 on the input of the VAD 22 and a speech recognition unit 24 can be shared by the VAD 22 and the speech recognition unit 24, as shown in FIG. 1. In this example, an endpoint detection unit 23 uses a VAD signal to distinguish between periods between the beginning and ending points of utterances and pauses between words. This is because the speech recognition unit 24 must accept as speech the entire utterance without any gaps.
There exists a large body of prior art in the field of VAD and endpoint detection. The following discussion is limited either to the most representative or most recent.
U.S. Pat. No. 4,696,039 discloses one approach to endpoint detection using a counter to determine the transition from speech to silence. Silence is hence detected after a predetermined time. In contrast, the present invention does not use such a predetermined period to determine state transitions.
U.S. Pat. No. 6,249,757 discloses another approach to end point detection using two filters. However, these filters run on the speech signal itself, not a VAD metric or thresholded signal.
Much prior art uses state machines driven by counting fixed periods: U.S. Pat. No. 6,453,285 discloses a VAD arrangement including a state machine. The machine changes state depending upon several factors, many of which are fixed periods of time. U.S. Pat. No. 4,281,218 is an early example of a state machine effected by counting frames. U.S. Pat. No. 5,579,431 also discloses a state machine driven by a VAD. The transitions again depend upon counting time periods. U.S. Pat. No. 6,480,823 recently disclosed a system containing many thresholds, but the thresholds are on an energy signal.
A state machine and a sequence of thresholds are also described in “Robust endpoint detection and energy normalization for real-time speech and speaker recognition”, by Li Zheng, Tsai and Zhou, IEEE transactions on speech and audio processing, Vol. 10, No. 3, March 2002. The state machine, however, still depends upon fixed time periods.
The prior art describes state machine based endpointers that rely on counting frames to determine the starting point and the ending point of speech. For this reason, these endpointers suffer from the following drawbacks:
First, bursts of noise (perhaps caused by wind blowing across a microphone, or footsteps) typically have high energy and are hence determined by the VAD metric to be speech. Such noises, however, yield a boolean (speech or non-speech) decision that rapidly oscillates between speech and non-speech. An actual speech signal tends to yield a boolean decision that indicates speech for a small contiguous number of frames, followed by silence for a small contiguous number of frames. Conventional frame counting techniques cannot in general distinguish these two cases.
Second, when counting silence frames to determine the end of a speech period, a single isolated speech decision can cause the counter to reset. This in turn delays the acknowledgement of the speech to silence transition.