Sounds typically contain a mix of music, noise, and/or human speech. The ability to detect human speech in sounds has important applications in many fields such as digital audio signal processing, analysis and coding. For example, specialized codecs (compression/decompression algorithms) have been developed for more efficient compression of pure sounds containing either music or speech, but not both. Most digital audio signal applications, therefore, use some form of speech detection prior to application of a specialized codec to achieve more compact representation of an audio signal for storage, retrieval, processing or transmission.
However, accurate detection of human speech by a computer in an audio signal produced by sounds containing a mix of music, noise and speech is not an easy task. Most existing speech detection methods use spectral and statistical analyses of the waveform patterns produced by the audio signal. The challenge is to identify features of the waveform patterns that reliably distinguish the pure-speech signals from the non-speech or mixed-speech signals.
For example, some existing methods of speech detection take advantage of a particular feature known as the zero-crossing rate (ZCR). See J. Saunders, "Real-time Discrimination of Broadcast Speech/Music", Proc. ICASSP'96, pp. 993-996, 1996. The ZCR feature provides a weighted average of the spectral energy distribution in the waveform. Human speech typically produces audio signals having a high ZCR, whereas other sounds, such as noise or music, do not. However, this feature may not always be reliable, as in the case of the sound of highly percussive music or structured noise, which can produce audio signals that have ZCRs indistinguishable from those of human speech.
Other existing methods employ several features, including the ZCR feature, in conjunction with elaborate statistical feature analysis, in an attempt to improve the accuracy of speech detection. See J. D. Hoyt and H. Wechsler, "Detection of Human Speech in Structured Noise", Proc. ICASSP'94, Vol. 11, 237-240, 1994; E. Scheirer and M. Slaney, "Construction and Evaluation of A Robust Multifeature Speech/Music Discriminator", Proc. ICASSP'97, 1997.
While a great deal of research has focused on human speech detection, all of these existing methods fail to satisfy one or more of the following desirable characteristics of a speech detection system for modern multimedia applications: high precision, robustness, short time delay and low complexity.
High precision is desirable in digital audio signal applications because it is important to determine the nearly "exact" time when the speech starts and stops, or the boundaries, accurate to within less than a second. Robustness is desirable so that the speech detection system can process audio signals containing a mixture of sounds including noise, music, song, conversation, commercials, etc., all of which may be sampled at different rates without human intervention. Moreover, most digital audio signal applications are real-time applications. Thus, it is advantageous if the speech detection method employed provides results within a few seconds and with as little complexity as possible, for real-time implementation at a reasonable cost.