The present invention relates to computer-based automatic segmentation of audio streams like speech or music into semantic or syntactic units like sentence, section and topic units, and more specifically to a method and a system for such a segmentation wherein the audio stream is provided in a digitized format.
Prior to or during production of audiovisual media like movies or broadcasted news a huge amount of raw audio material is recorded. This material is almost never used as recorded but subjected to editing, i.e. segments of the raw material relevant to the production are selected and assembled into a new sequence.
Today this editing of the raw material is a laborous and time-consuming process involving many different steps, most of which require human intervention. A crucial manual step during the preprocessing of the raw material is the selection of cut points, i.e. the selection of possible segment boundaries, that may be used to switch between different segments of the raw material during the subsequent editing steps. Currently these cut points are selected interactively in a process that requires a human to listen for potential audio cut points like speaker changes or the end of a sentence, to determine the exact time at which the cut point occurs, and to add this time to an edit-decision list (EDL).
For the above reasons there is a substantial need to automate part or even all of the above preprocessing steps.
Many audio or audio-visual media sources like real media streams or recordings of interviews, conversations or news broadcasts are available only in audio or audio-visual form. They lack in corresponding textual transcripts and thus in typographic cues such as headers, paragraphs, sentence punctuation, and capitalization that would allow for segmentation of those media streams only by linking transcript information to audio (or video) information as proposed in U.S. patent application Ser. No. 09/447,871 filed by the present assignee. But those cues are absent or hidden in speech output.
Therefore a crucial step for the automatic segmentation of such media streams is automatic determination of semantic or syntactic boundaries like topics, sentences and phrase boundaries.
There exist approaches that use prosody whereby prosodic features are those features that have not only influence on single media stream segments (called phonemes) but extend over a number of segments. Exemplary prosodic features are information extracted from the timing and melody of speech like pausing, changes in pitch range or amplitude, global pitch declination, or melody and boundary tone distribution for the segmentation process.
As disclosed in a recent article by E. Shriberg, A. Stolcke et al. entitled “Prosody-Based Automatic Segmentation of Speech into Sentences and Topics”, published as pre-print to appear in Speech Communication 32(1-2) in September 2000, evaluation of prosodic features usually is combined with statistical language models mainly based on Hidden Markov Theory and thus presume words already decoded by a speech recognizer. The advantage to use prosodic indicators for segmentation is that prosodic features are relatively unaffected by word identity and thus improve the robustness of the entire segmentation process.
A further article by A. Stolcke, E. Shriberg et al. entitled “Automatic Detection of Sentence Boundaries and Disfluencies Based on Recognized Words”, published as Proceedings of the International Conference on Spoken Language Processing, Sydney, 1998, concerns also segmentation of audio streams using a prosodic model and is accordingly based on speech already transcribed by an automatic speech recognizer.
In the above at first cited article, prosodic modeling is mainly based on only very local features, whereby for each inter-word boundary prosodic features of the word immediately preceding and following the boundary, or alternatively within an empirically optimized window of 20 frames before and after the boundary, arc analyzed. In particular, prosodic features are extracted that reflect pause durations, phone durations, pitch information, and voice quality information. Pause features are extracted at the inter-word boundaries. Pause duration, a fundamental frequency (F0), and voice quality features are extracted mainly from the word and window preceding the boundary. In addition, pitch-related features reflecting the difference in pitch across the boundary are included in the analysis.
In the above article by E. Shriberg et al., chapter 2.1.2.3, generally refers to a mechanism for determining the F0 signal. A similar mechanism according to the invention will be discussed in more detail later referring to FIG. 3. The further details of the F0 processing are of no relevance for the understanding of the present invention.
As mentioned above, the segmentation process disclosed in the pre-cited article is based on language modeling to capture information about segment boundaries contained in the word sequences. That described approach is to model the joint distribution of boundary types and words in a Hidden Markov Model (HMM), the hidden variable being the word boundaries. For the segmentation of sentences it is therefore relied on a hidden-event N-gram language model where the states of the HMM consist of the end-of-sentence status of each word, i.e. boundary or no-boundary, plus any preceding words and possible boundary tags to fill up the N-gram context. As commonly known in the related art, transition probabilities are given by N-gram probabilities that in this case are estimated from annotated, boundary-tagged user-specific training data.
Concerning segmentation, the authors of that article further propose use of a decision tree where a number of prosodic features are used that fall into different groups. To designate the relative importance of these features in the decision tree, a measure called “feature usage” is utilized that is computed as the relative frequency with which that feature or feature class is queried in the decision tree. The prosodic features used are pause duration at a given boundary, turn/no turn at the boundary, F0 difference across the boundary, and rhyme duration.
The above cited prior art approaches have the drawback that they either necessarily use a speech recognizer or that they require multiple processing steps. The entire known segmentation process is error-prone, i.e. has to rely on speech recognizer output, and is time-consuming. In addition, most of the known approaches use complex mechanisms and technologies and thus their technical realization is rather cost-extensive.