Our invention relates to automatic speech recognition, and more particularly, to arrangements for detecting the endpoints or boundaries of the speech portion of an input signal.
An automatic speech recognizer identifies an unknown spoken utterance by matching an input signal which corresponds to the unknown utterance, to reference template signals which correspond to known utterances. The reference template which matches best is selected as the identity of the unknown utterance. The reference templates typically include only information-bearing or speech portions. On the other hand, in many commercially important environments, the input signal often includes both speech and nonspeech sounds. An input signal from the switched telephone network, for example, may have clicks, pops, tones and other background noise.
Whereas human listeners are comparatively tolerant of noise and distortion, current machine recognizers generally are not. Accurate location of the beginning and ending, the "endpoints" of spoken words and phrases, is thus important for reliable and robust automatic speech recognition. The endpoint detection problem is relatively less complex for high level speech signals in a low level, stationary noise environment, for example, where the signal-to-noise ratio is greater than about 30 dB. The problem is considerably more difficult, however, if the speech signal level is low relative to the background noise, or if the level and spectral content of the background noise is nonstationary. Such conditions may be encountered in the switched telephone network, especially in the long distance network, due to transmission line characteristics and transients in line signal generators.
In a prior endpoint detector, disclosed in U.S. Pat. No. 4,370,521, issued Jan. 25, 1983 to Johnston et al. and assigned to the present assignee, an input signal interval which contains speech is divided into a sequence of time frames. The energy level of the signal in each time frame is computed. Responsive to the energy levels, one or more energy pulses are identified over the signal interval. Each energy pulse consists of a group of contiguous time frames which correspond to a potential speech portion of the input signal. For example, an input signal interval containing the spoken words "one eight" ideally yields three distinct energy pulses: the first corresponding to the voiced portion "one"; the second corresponding to the voiced portion "eigh"; and the third corresponding to the unvoiced portion "t".
Next, certain of the raw energy pulses are "combined", that is, the constituent frames of two or more adjacent energy pulses are grouped together to form a longer energy pulse. In the above example, the second and third energy pulses may be combined to form a single energy pulse corresponding to "eight". Finally, the endpoints of the energy pulses remaining after the combining steps are passed to a speech recognizer.
In more detail, the identification of the raw energy pulses according to Johnston proceeds as follows. The energy levels are considered frame by frame in temporal sequence. If the energy level rises above a first threshold, and then above a second threshold before falling below the first threshold, the frame in which the energy level first rose above the first threshold is designated as the beginning frame of an energy pulse. Subsequently, the first frame in which the energy level falls below a third threshold is designated as the ending frame of the energy pulse. This process is repeated over the remainder of the input signal interval whereby a plurality of energy pulses may be detected.
The Johnston arrangement attempts to find endpoints based on the energy of speech rising above the energy of the background noise. This may be conveniently characterized as a "bottom-up" approach. The bottom-up endpoint detector works well where the background noise is stationary. Where the level and spectral content of the background noise fluctuates, however, the bottom-up detector may be less effective.
It is thus an object of the invention to provide an endpoint detector which improves the accuracy of a speech recognizer where the input signal include nonstationary noise.