Our invention relates to automatic speech recognition and, more particularly, to arrangements for detecting the endpoints or boundaries of the speech portion of an utterance.
Automatic speech recognition is the focus of vigorous research toward enabling voice communication between man and machine. Isolated word recognition systems have been developed which require a pause between utterances. Typically, such systems have a reference vocabulary of words stored as digital templates. An input utterance is converted to digital form and compared to the reference templates for identification. In order to efficiently process the matching of an utterance to a reference template, it is first necessary to distinguish speech sounds from non-speech sounds in the input utterance. Outside a carefully controlled laboratory environment, however, it is difficult to accurately locate the endpoints of the speech sounds. Background noise, such as found on telephone lines, may be confused with speech sounds of low amplitude. In the word "three", for example, the "th" fricative is unvoiced and is of low amplitude. On the other hand, higher amplitude non-speech sounds must not be identified as speech. Clicks and pops in the transmission system and comparable speaker induced artifacts may have a higher amplitude than some fricatives, but contain no information useful for speech processing. Similarly, it may be difficult to distinguish artifacts from stop consonant releases. In the word "eight", for example, the voiced phonetic sound "eigh" is followed by a slight pause before the consonant sound "t" is released.
A prior endpoint detector, disclosed in U.S. Pat. No. 3,909,532, issued Sept. 30, 1975 to Rabiner et al and assigned to the same assignee, uses an energy measurement of digitally encoded speech. The beginning of the speech portion of an utterance is detected when the energy exceeds a predetermined threshold value for a fixed interval of time. Likewise, the end of the speech portion is detected when the energy drops below the threshold for another fixed interval of time. The endpoint detector may, however, omit speech sounds which fall below the threshold.
The article by L. R. Rabiner and M. R. Sambur entitled, "An Algorithm for Determining the Endpoints of Isolated Utterances", appearing in the Bell System Technical Journal, Vol. 54, page 297, 1975, describes an improved endpoint detector for isolated word recognition. The beginning of the speech portion of an utterance is defined as the point where the energy first exceeds a lower threshold if it then exceeds an upper threshold before falling below the lower threshold. The end of the speech portion is detected at the point where the energy drops below the lower threshold. The endpoints are then adjusted using a zero crossing measurement for detecting unvoiced speech. This improved endpoint detector may not, however, accurately discriminate against non-speech sounds which exceed the upper threshold.
In U.S. Pat. No. 4,032,710, issued June 28, 1977 to Martin et al, an endpoint detector extracts three feature signals from isolated word input. Each feature signal comprises selected spectral components of the input speech. The first feature signal sets the starting point of the speech portion where the energy of the selected components exceeds a predetermined threshold. The ending point is set where the energy falls below the threshold. The first feature signal persists for a lag time to account for stop gaps within words. The second and third feature signals, which have spectral components found in voiced and unvoiced speech, but not in breath noise, are used to adjust the endpoint estimates obtained from the first feature signal. The feature signal endpoint detector is not, however, adapted to accurately determine the endpoints when an artifact exceeds the predetermined energy threshold within the lag time of the first feature signal.
It is thus an object of the invention to provide an improved arrangement for determining the endpoints of the speech portion of an utterance containing artifacts and background noise comparable to the energy levels of weak speech sounds.