The ability to subdivide an audio stream into segments containing samples from a source having constant acoustic characteristic, such as from a particular human speaker, a type of background noise, or a type of music, and then to classify each homogeneous segment into one of a number of categories lends itself to many applications. Such applications include listing and indexing of audio libraries in order to assist in effective searching and retrieval, speech and silence detection in telephony and other modes of audio transmission, and automatic processing of video in which some level of understanding of the content of the video is aided by identification of the audio content contained in the video.
Past work in this area has focused on indexing audio databases, where performance and memory constraints are relaxed. Real-time methods are most commonly specific to speech detection and speech recognition, and are not designed to work with arbitrary audio models. Model-based segmentation methods, such as those using Hidden Markov Models (HMMs), efficiently segment and classify audio, but have difficulties dealing with audio that does not match any predefined model. In addition, segmentation boundaries are limited to boundaries between regions of different classification. It is desirable to separate segmentation and classification, but doing so using known methods results in an unacceptable delay in reporting classifications for detected segments.
A successful approach for segmenting audio that has been used is the Bayesian Information Criterion (BIC). The BIC is a classical statistical approach for assessing the suitability of a distribution for a set of sample data. When applied to audio segmentation, the BIC is used to determine whether a segment of audio is better described by one distribution or two (different) distributions, hence allowing a segmentation decision to be made. It is possible to perform a second BIC-based segmentation pass over the resulting segmentation in order to eliminate segment boundaries that are not deemed statistically significant. A disadvantage of such an approach is that the second BIC-based segmentation pass needs the original data on which the first segmentation was based, requiring storage for data of indefinite length.