The present invention relates generally to information retrieval systems and, more particularly, to methods and apparatus for retrieving multimedia information, such as audio and video information, satisfying user-specified criteria from a database of multimedia files.
Information retrieval systems have focused primarily on retrieving text documents from large collections of text. The basic principles of text retrieval are well established and have been well documented. See, for example, G. Salton, Automatic Text Processing, Addison-Wesley, 1989. An index is a mechanism that matches descriptions of documents with descriptions of queries. The indexing phase describes documents as a list of words or phrases, and the retrieval phase describes the query as a list of words or phrases. A document (or a portion thereof) is retrieved when the document description matches the description of the query.
Data retrieval models required for multimedia objects, such as audio and video files, are quite different from those required for text documents. There is little consensus on a standard set of features for indexing such multimedia information. One approach for indexing an audio database is to use certain audio cues, such as applause, music or speech. Similarly, an approach for indexing video information is to use key frames, or shot changes. For audio and video information that is predominantly speech, such as audio and video information derived from broadcast sources, the corresponding text may be generated using a speech recognition system and the transcribed text can be used for indexing the associated audio (and video).
Currently, audio information retrieval systems consist of two components, namely, a speech recognition system to transcribe the audio information into text for indexing, and a text-based information retrieval system. Speech recognition systems are typically guided by three components, namely, a vocabulary, a language model and a set of pronunciations for each word in the vocabulary. A vocabulary is a set of words that is used by the speech recognizer to translate speech to text. As part of the decoding process, the recognizer matches the acoustics from the speech input to words in the vocabulary. Therefore, the vocabulary defines the words that can be transcribed. If a word that is not in the vocabulary is to be recognized, the unrecognized word must first be added to the vocabulary.
A language model is a domain-specific database of sequences of words in the vocabulary. A set of probabilities of the words occurring in a specific order is also required. The output of the speech recognizer will be biased towards the high probability word sequences when the language model is operative. Thus, correct decoding is a function of whether the user speaks a sequence of words that has a high probability within the language model. Thus, when the user speaks an unusual sequence of words, the decoder performance will degrade. Word recognition is based entirely on its pronunciation, i.e., the phonetic representation of the word. For best accuracy, domain-specific language models must be used. The creation of such a language model requires explicit transcripts of the text along with the audio.
Text-based information retrieval systems typically work in two phases. The first phase is an off-line indexing phase, where relevant statistics about the textual documents are gathered to build an index. The second phase is an on-line searching and retrieval phase, where the index is used to perform query-document matching followed by the return of relevant documents (and additional information) to the user. During the indexing phase, the text output from the speech recognition system is processed to derive a document description that is used in the retrieval phase for rapid searching.
During the indexing process, the following operations are generally performed, in sequence: (i) tokenization, (ii) part-of-speech tagging, (iii) morphological analysis, and (iv) stop-word removal using a standard stop-word list. Tokenization detects sentence boundaries. Morphological analysis is a form of linguistic signal processing that decomposes nouns into their roots, along with a tag to indicate the plural form. Likewise, verbs are decomposed into units designating person, tense and mood, along with the root of the verb. For a general discussion of the indexing process, see, for example, S. Dharanipragada et al., xe2x80x9cAudio-Indexing for Broadcast News,xe2x80x9d in Proc. SDR97, 1997 incorporated by reference herein.
While such content-based audio information retrieval systems allow a user to retrieve audio files containing one or more key words specified in a user-defined query, they are limited by the accuracy of the transcription process. Generally, the transcription process provides the best word sequence and rejects all others. Thus, if the transcription process improperly identifies a word or phrase in a given document, the document will be overlooked (and not returned to the user) during the query-document matching phase.
Generally, an audio retrieval system and method are disclosed for augmenting the transcription of an audio file with one or more alternate word or phrase choices, such as next-best guesses for each word or phrase, in addition to the best word sequence identified by the transcription process. The audio retrieval system can utilize a primary index file containing the best identified words and/or phrases for each portion of the input audio stream and a supplemental index file containing one or more alternative choices for each word or phrase in the transcript. The present invention allows words that are incorrectly transcribed during speech recognition to nonetheless be identified in response to a textual query by searching the supplemental index files.
During an indexing process, the list of alternative word or phrase choices provided by the speech recognition system are collected to produce a set of supplemental index files. During a retrieval process, the user-specified textual query is matched against the primary and supplemental indexes derived from the transcribed audio to identify relevant documents. An objective ranking function scales matches found in the supplemental index file(s) using a predefined scaling factor, or a value reflecting the confidence value of the corresponding alternative choice as identified by the speech recognition system.
A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.