There is an increasing demand for automated computer systems that extract meaningful information from large amounts of data. One such application is the extraction of information from continuous streams of audio. Such continuous audio streams may include speech from, for example, a news broadcast or a telephone conversation, or non-speech, such as music or background noise.
In order for a system to be able to extract information from the continuous audio stream, the system is typically first required to segment the continuous audio stream into homogeneous segments, each segment including audio from only one speaker or other constant acoustic condition. Once the segment boundaries have been located, each segment may be processed individually to, for example, classify the information contained within each of the segments.
Whilst a number of techniques have been proposed in a somewhat ad-hoc manner for segmenting audio in specific applications, one of the most successful approaches that has been used is an approach based on the Bayesian Information Criterion (BIC). The BIC is a model selection criterion known in statistical literature and is used to determine the positions of segment boundaries by determining the most likely positions where the signal characteristics change. When applied to audio segmentation, the BIC is used to determine whether a section of audio is better described by one statistical model or two different statistical models, hence allowing a segmentation decision to be made. It also gives a criterion to determine whether the change at this point is significant, or not.
Previous systems performing audio segmentation with the BIC have made the assumption that the statistical model characterising each audio segment is a Gaussian process. However, the Gaussian model tends not to hold very well when only a small amount of data is available for the audio stream between segment changes. Thus, segmentation performs very poorly with the Gaussian BIC under these conditions.
Another major setback for BIC-based segmentation systems is the computation time required to segment large audio streams. This is due to the fact that previous BIC systems have used multi-dimensional features for describing important characteristics within the audio stream, such multi-dimensional features being those of the mel-cepstral vectors or linear predictive coefficients.