1. Field of the Invention
The present invention relates to a segmentation method of an audio data stream, which is broadcasted or recorded using some media, wherein this audio data stream is a sequence of digital samples, or may be transformed to the sequence of digital samples. The goal of such segmentation is the division of audio data stream into segments, which correspond to different physical sources of audio signal. In the case when some source(s) and some background source(s) emit audio signal, parameters of background source(s) is not changed essentially in the framework of one segment.
2. Description of the Related Art
Audio and video recordings have become commonplace with the advent of consumer grade recording equipment. Unfortunately, both the audio and video streams provide few clues to assist in accessing the desired section of the record. In books, indexing is provided by the table of contents at the front and the index at the end, which readers can browse to locate authors and references to authors. A similar indexing scheme would be useful in an audio stream, to help in location of sections where, for example, specific speakers were talking. The limited amount of data associated with most audio records does not provide enough information for confidently and easily access to desired points of interest. So, user has to peruse the contents of a record in sequential order to retrieve desired information.
As a solution of this problem, it is possible to use the automatic indexing system of audio events in the audio data stream. The indexation process consists of two sequential parts: segmentation and classification processes. Segmentation process implies division of the audio stream into homogeneous (in some sense) segments. Classification process implies the attributing of these segments by appropriate notes. Thus, segmentation process is the first and very important stage in the indexation process. To this problem, the basic notice in the given invention is given.
As the basic audio events in the audio stream, it is accepted to consider speech, music and noise (that is non-speech and non-music). The basic notice in a world is given to the speech detection, segmentation and indexation in audio stream, such as broadcast news.
Broadcast news data come to use in long unsegmented speech streams, which not only contain speech with various speakers, backgrounds, and channels, but also contain a lot of non-speech audio information. So it is necessary to chop the long stream into smaller segments. It is also important to make these smaller segments homogeneous (each segment contains the data from one source only), so that the non-speech information can be discarded, and those segments from the same or similar source can be clustered for speaker normalization and adaptation.
Zhan et al., “Dragon Systems' 1997 Mandarin Broadcast News System”, Proceedings of the Broadcast News transcription and Understanding Workshop, Lansdowne, Va., pp. 25-27, 1998, produced the segments by looking for sufficiently long silence regions in the output of a coarse recognition pass. This method generated considerable multi-speaker segments, and no speaker change information was used in the segmentation.
In the subsequent works, Wegmann et al., “Progress in Broadcast News Transcription at Dragon System”, Proceedings of ICASSP'99, Phoenix, Ariz., March, 1999, used the speaker change detection in the segmentation pass. The following is a procedure of their automatic segmentation:
An amplitude-based detector was used to break the input into chunks that are 20 to 30 seconds long.
These chunks were chopped into 2 to 30 seconds long, based on silences produced from a fast word recognizer.
These segments were further refined using a speaker change detector.
Balasubramanian et al., U.S. Pat. No. 5,606,643, enables retrieval based on indexing an audio stream of a recording according to the speaker. In particular, the audio stream may be segmented into speaker events, and each segment labeled with the type of event, or speaker identity. When speech from individuals is intermixed, for example in conversational situations, the audio stream may be segregated into events according to speaker difference, with segments created by the same speaker identified or marked.
Creating an index in an audio stream, either in real time or in post-processing, may enable a user to locate particular segments of the audio data. For example, this may enable a user to browse a recording to select audio segments corresponding to a specific speaker, or “fast-forward” through a recording to the next speaker. In addition, knowing the ordering of speakers can also provide content clues about the conversation, or about the context of the conversation.
The ultimate goal of the segmentation is to produce a sequence of discrete segments with particular characteristics remaining constant within each one. The characteristics of choice depend on the overall structure of the indexation system.
Saunders, “Real-Time Discrimination of Broadcast Speech/Music”, Proc. ICASSP 1996, pp. 993-996, has described a speech/music discriminator based on zero-crossings. Its application is for discrimination between advertisements and programs in radio broadcasts. Since it is intended to be incorporated in consumer radios, it is intended to be low cost and simple. It is mainly designed to detect the characteristics of speech, which are described as, limited bandwidth, alternate voiced and unvoiced sections, limited range of pitch, syllabic duration of vowels, energy variations between high and low levels. It is indirectly using the amplitude, pitch and periodicity estimate of the waveform to carry out the detection process since zero-crossings give an estimate of the dominant frequency in the waveform.
Zue and Spina, “Automatic Transcription of General Audio Data: Preliminary Analyses”, Proc. ICSP 1996, pp. 594-597, use an average of the cepstral coefficients over a series of frames. This is shown to work well in distinguishing between speech and music when the speech is band-limited to 4 kHz and music to 16 kHz but less well when both signals occupied a 16 kHz bandwidth.
Scheier and Slaney, “Construction and Evalution of a Robust Multifeature Speech/Music Discriminator”, Proc. ICASSP 1997, pp. 1331-1334, use a variety of features. These are: four hertz modulation energy, low energy, roll off of the spectrum, the variance of the roll off of the spectrum, the spectral centroid, variance of the spectral centroid, the spectral flux, variance of the spectral flux, the zero-crossing rate, variance of the zero-crossing rate, the cepstral residual, variance of the cepstral residual, pulse metric. The first two features are amplitude related. The next six features are derived from the fine spectrum of the input signal and therefore are related to the techniques described in the previous reference.
Carey et al., “A Comparison of Features for Speech, Music Discrimination”, Proc. IEEE 1999, pp. 149-152, use a variety of features. There are: cepstral coefficients, delta cepstral coefficients, amplitude, delta amplitude, pitch, delta pitch, zero-crossing rate, delta zero-crossing rate. The pitch and cepstral coefficients encompass the fine and broad spectral features respectively. The zero-crossing parameters and the amplitude were believed worthy of investigation as a computationally inexpensive alternative to the other features.