1. Field of the Invention
The present invention relates generally to the field of audio data processing systems and methods, and, more particularly, to a novel system and method for performing blind change detection audio segmentation.
2. Discussion of the Prior Art
Many audio resources like broadcast news contain different kinds of audio signals like speech, music, noise, and different environmental and channel conditions. The performance of many applications based on these streams like speech recognition and audio indexing degrades significantly due to the presence of the irrelevant portions of the audio stream. Therefore segmenting the data to homogeneous portions according to type (speech, noise, music, etc.), speaker identity, environmental conditions, and channel conditions has become an important preprocessing step before using them. The previous approaches for automatic segmentation of audio data can be classified into two categories: informed and blind. Informed approaches include both decoder-based and model-based algorithms. In decoder-based approaches, the input audio stream is first decoded using speech and silence models; then the desired segments can be produced by using the silence locations generated by the decoder. In model-based approaches, different models are built to represent the different acoustic classes expected in the stream and the input audio stream can be classified by maximum likelihood selection and then locations of change in the acoustic class are identified as segmental boundaries. In both cases, models trained on the data representing all acoustic classes of interest are used in the automatic segmentation. The informed automatic segmentation is limited to applications where enough amount of training data is available for building the acoustic models. It can not generalize to unseen acoustic conditions in the training data. Also approaches based solely on speech and silence models mainly detect silence locations that are not necessarily corresponding to boundaries between different acoustic segments. We will focus on blind automatic segmentation techniques which do not suffer from these limitations and therefore serve a wider range of applications.
Blind change detection avoids the requirements of the informed approach by trying to build models of the observations in a neighborhood of a candidate point under the two hypothesis of change and no change and using a criterion based on the log likelihood ratio of these two models for automatic segmentation of the acoustic data. Most of the previous approaches had the goal of providing an input to a speech recognition, or a speaker adaptation system. Therefore they provided the evaluation of their systems based on comparisons of the word error rates achieved by using the automatic and the manual segmentation not the accuracy of the generated boundaries using the automatic segmentation. Exceptions of this trend include when the main focus is data indexing.
In many applications like on-line audio indexing and information retrieval, the goal of the automatic segmentation algorithm is to detect the changes in the input audio stream and to keep the number of false alarms as low as possible. Unfortunately all of the current techniques for automatic blind segmentation like using the Kullback-Liebler distance, the generalized likelihood ratio distance, or the Bayesian Information Criterion (BIC) try to optimize an objective function that is not directly related to minimizing the missing probability for a given false alarm rate. If the missing probability is defined as the probability of not detecting a change within a reasonable period of time of a valid change in the stream, then minimizing the missing probability is equivalent to minimizing the duration between the detected change and the actual change, namely the detection time.
Known solutions of this problem like using the BIC criterion are not accurate enough and have robustness problems due to employing a single criterion that is not directly related to minimizing the missing probability for a given false alarm rate and comparing this criterion to a threshold.
Thus, it would be highly desirable to provide a novel approach for solving the automatic audio segmentation problems described herein with respect to the prior art.
It would be highly desirable to provide a novel approach for solving the automatic audio segmentation problem that combines the results of several segmentation algorithms to achieve better and more robust segmentation.