1. Field of the Invention
The present invention relates generally to audio streams including audio streams extracted from video, and more particularly to systems and methods for classifying and indexing audio streams to support subsequent retrieval, gisting, summarizing, skimming, and general searching of the audio streams.
2. Description of the Related Art
Accompanying the burgeoning growth of computer use in general and multimedia computer applications in particular, a large amount of audio continues to be produced from, e.g., audio-video applications, and the audio then electronically stored. As recognized by the present invention, as the number of audio files grows, it becomes increasingly difficult to use stored audio streams quickly and efficiently using only existing audio file directories or other existing means for access. For example, it might be desirable to access an audio stream derived from, e.g., video, based on a user query to retrieve information, or to present a summary of audio streams, or to enable a user to skim or gist an audio stream. Accordingly, the present invention recognizes a growing need to efficiently search for particular audio streams to which access is desired by a user but which might very well be stored with thousands of other audio streams.
Conventional information retrieval techniques are based on the assumption that the source text, whether derived from audio or not, is free of noise and errors. When a source text is derived from audio, however, the above assumption is a poor one to make. This is because speech recognition engines are used to convert an audio stream to computer-stored text, and given the inexact nature and inherent difficulty of the task, such conversion is virtually impossible to accomplish without errors and without introducing noise in the text. For example, certain words in an audio stream may not be recognized correctly (e.g., spoken "land" might be translated to "lamb") or at all, thereby diminishing the recall capability and precision of an information retrieval system. By "precision" is meant the capability of a system to retrieve only "correct" documents, whereas "recall" refers to a system's capability in retrieving as many correct documents as possible. Fortunately, we have recognized that it is possible to account for limitations of speech recognition engines in converting audio streams to text, and that accounting for these limitations, it is possible to improve the precision and recall of an information retrieval system.
In addition to the above considerations, the present invention recognizes that in many instances, a user might want to recall a digitally stored audio stream to listen to it, but the user might not wish to listen to or access information from an entire audio stream, but only from particular portions of it. Indeed, a user might wish only to hear a summary of an audio stream or streams, or to gain an understanding of the gist of an audio stream. For example, a user might wish only to hear portions of audio streams having to do with particular topics, or spoken by particular people, or, in the case of recorded programming, a user might prefer to listen only to non-commercial portions of the programming. Similarly, a user might want to "fast forward" through audio. For example, a user might want to speed up "less interesting" portions of an audio stream (e.g., commercials) while keeping "interesting" portions at a comprehensible speed.
Past efforts in audio content analysis, however, such as those disclosed in Japanese patent publications 8063184 and 10049189 and European patent publication 702351, have largely focused not on the above considerations, but rather simply on improving the accuracy of speech recognition computer input devices, or on improving the quality of digitally-processed speech. While perhaps effective for their intended purposes, these past efforts do not seem to consider and consequently do not address indexing audio streams based on audio events in the streams, to support subsequent searching, gisting, and summarization of computer-stored audio streams.
U.S. Pat. No. 5,199,077 discloses wordspotting for voice editing and indexing. This method works for keyword indexing of single speaker audio or video recordings. The above-mentioned Japanese patent publications 8063184 and 10049189 refer to audio content analysis as a step towards improving speech recognition accuracy. Also, Japanese patent publication 8087292A uses audio analysis for improving the speed of speech recognition systems. The above-mentioned European patent publication EP702351A involves identifying and recording audio events in order to assist with the recognition of unknown phrases and speech. U.S. Pat. No. 5,655,058 describes a method for segmenting audio data based on speaker identification, while European patent publication EP780777A describes the processing of an audio file by speech recognition systems to extract the words spoken in order to index the audio.
The methods disclosed in these systems target improving the accuracy and performance of speech recognition. The indexing and retrieval systems disclosed are based on speaker identification, or direct application of speech recognition on the audio track and the use of words as search terms. The present system, in contrast, is directed towards indexing, classification, and summarization of real world audio which, as understood herein, seldom consists of single speaker, clear audio consisting of speech segments alone. Recognizing these considerations, the present invention improves on prior word spotting techniques using the system and method fully set forth below, in which music and noise is segmented from the speech segments, speech recognition applied to the clear speech segments, build an advanced retrieval system built which takes the results of audio analysis into account.
Other techniques have been described for analyzing the content of audio, including the method disclosed in Erling, et al. in an article entitled "Content-Based Classification, Search, and Retrieval of Audio", published in IEEE Multimedia, 1996 (hereinafter "Musclefish"). The method by which Musclefish classifies sounds, however, is not driven by heuristically determined rules, but rather by statistical analysis. As recognized by the present invention, heuristically determined rules are more robust than statistical analyses for classifying sounds, and a rule-based classification method can more accurately classify sound than can a statistics-based system. Furthermore, the Musciefish system is intended to be used only on short audio streams (less than 15 seconds). This renders it inappropriate for information retrieval from longer streams.
Still other methods have been disclosed for indexing audio, including the method disclosed by Pfeiffer et al. in an article entitled "Automatic Audio Content Analysis", published in ACM Multimedia 96 (1996) (hereinafter "MoCA"). Like many similar methods, however the MoCA method is domain specific, i.e., it seeks to identify audio that is related to particular types of video events, such as violence. The present invention recognizes that many audio and multimedia applications would benefit from a more generalized ability to segment, classify, and search for audio based on the content thereof, and more specifically based on one or more predetermined audio events therein.