The present invention relates generally to speech processing and speech recognizing systems. More particularly, the invention relates to a detection system for detecting the beginning and ending of speech within an input signal.
Automated speech processing, for speech recognition and for other purposes, is currently one of the most challenging tasks a computer can perform. Speech recognition, for example, employs a highly complex pattern-matching technology that can be very sensitive to variability. In consumer applications, recognition systems need to be able to handle a diverse range of different speakers and need to operate under widely varying environmental conditions. The presence of extraneous signals and noise can greatly degrade recognition quality and speech-processing performance.
Most automated speech recognition systems work by first modeling patterns of sound and then using those patterns to identify phonemes, letters, and ultimately words. For accurate recognition, it is very important to exclude any extraneous sounds (noise) that precede or follow the actual speech. There are some known techniques that attempt to detect the beginning and ending of speech, although there still is considerable room for improvement.
The present invention divides the incoming signal into frequency bands, each band representing a different range of frequencies. The short-term energy within each band is then compared with a plurality of thresholds and the results of the comparison are used to drive a state machine that switches from a xe2x80x9cspeech absentxe2x80x9d state to a xe2x80x9cspeech presentxe2x80x9d state when the band-limited signal energy of at least one of the bands is above at least one of its associated thresholds. The state machine similarly switches from a xe2x80x9cspeech presentxe2x80x9d state to a xe2x80x9cspeech absentxe2x80x9d state when the band-limited signal energy of at least one of the bands is below at least one of its associated thresholds. The system also includes a partial speech detection mechanism based on an assumed xe2x80x9csilence segmentxe2x80x9d prior to the actual beginning of speech.
A histogram data structure accumulates long-term data concerning the mean and variance of energy within the frequency bands, and this information is used to adjust adaptive thresholds. The frequency bands are allocated based on noise characteristics. The histogram representation affords strong discrimination between speech signal, silence and noise, respectively. Within the speech signal itself, the silence part (with only background noise) typically dominates, and it is reflected strongly on the histogram. Background noise, being comparatively constant, shows up as noticeable spikes on the histogram.
The system is well adapted to detecting speech in noisy conditions and it will detect both the beginning and end of speech as well as handling situations where the beginning of speech may have been lost through truncation.
For a more complete understanding of the invention, its objects and advantages, reference may be had to the following specification and to the accompanying drawings.