1. Technical Field
The present invention relates to speech detection and, more particularly, relates to improved approaches to efficiently detect speech presence in a noisy environment by way of frequency and temporal considerations.
2. Description of the Related Art
In some applications, automatic speech recognition needs to be activated by uttering a particular word sequence such as keywords. For example, if a desktop personal computer has a speech recognizer for dictation or command control, it is desirable to activate the recognizer in the middle of the conversations in his or her office by uttering a keyword. This process of recognizing the keyword from continuous speech waveform is called keyword scanning. This would require the recognizer constantly recognizing the incoming speech and spotting those keywords. Nevertheless, the recognizer cannot be used to constantly monitor the incoming speech because it takes huge computational resources. Some other techniques that demand much less computations and memories have to be utilized to reduce the burden of speech recognizer. It is known that speech detection techniques are ways of eliminating silence segments from speech utterances so that speech recognizer can be speed up and do not wasting a lot of time on those silences or even misrecognize silence as speech. Speech detection techniques are often based on the speech waveform and utilize features such as short-time energy, zero crossing and etc. The same can be used to hypothesize keyword if some other features such as pitch, duration and voicing can be used in junction with word end-pointing techniques. Although the keyword hypothesis will be over generated, it still can reduce a large proportion of computations since the recognizer will only process these hypotheses.
Most speech recognition applications today face the challenging task of segmenting speech based on voice, unvoice & silence detection. A conventional approach is detecting short-term energy and zero crossings of a speech signal. These approaches are not reliable for noisy telephone speech signals due, in part, to the greater noise in a background environment of most telephone conversations. For example, stationary noise such as motor or wind noise and non-stationary noise such as door openings, closing or respiratory exhalation are present in telephone speech.
Accurate speech presence detection also conserves power and processing time for portable electronic devices such as cellular telephones. When reliable speech detection approaches are used, a speech recognition algorithm must find the utterances to determine if they are in fact language. This places a burden on computational complexity of processors and is a resource drain on portable electronic devices. A speech detection approach having computational efficiency as well as accuracy is needed.