Detecting the presence of music in an audio stream is a desirable feature in several applications. Examples include automatic switching on or off of sound effects (equalizer, virtual surround, bass boost, bandwidth extension, etc.) in audio players, automatic sorting of databases, etc. Many approaches to automatically discriminating speech from music have been developed but these approaches have limited success. In general, high computational cost and low robustness have prevented the use of such systems in real-world applications.
Many existing approaches for speech-music discrimination include the use of the zero-crossing rate as a discriminating feature. The zero-crossing rate provides a good measure of spectral distribution in the time domain and represents a useful feature to capture peculiarities of speech signals such as the succession of voiced and unvoiced speech. One approach, described in Saunders, J., “Real-time discrimination of broadcast speech/music,” Proc. of ICASSP'96, pp. 993-996, uses the average zero-crossing rate as the main discriminating feature. However, the zero-crossing rate is not very effective in audio streams that include speech mixed with background music or high levels of noise. Thus other approaches use the zero-crossing rate in conjunction with other features to perform speech-music discrimination. Examples of such approaches are found in Scheirer, E. and Slaney, M., “Construction and evaluation of a robust multifeature speech/music discriminator,” Proc. ICASSP 1997, pp. 1331-1334 and Carey, M. J., Parris, E. S., and Lloyd-Thomas, H., “A comparison of features for speech, music discrimination,” Proc. ICASSP 1999, pp. 149-152. These complex approaches tend to be computationally expensive and thus impractical for many applications.