1. Field of the Invention
The present invention generally relates to speech technology application and audio information retrieval. More specifically, the present invention relates to the integration of multiple speech recognition technologies with audio-specific information retrieval algorithms for rapid multimedia indexing and retrieval.
2. Related Art
The ability to accurately detect and retrieve information from audio files is plagued by numerous problems. The desire to detect and retrieve information from multimedia files exacerbates these difficulties, since multimedia files containing audio information are not readily searchable by conventional text-based methods.
To search an audio file according to specified criteria to uncover relevant information, an analyst typically uses an automatic speech recognizer (ASR) to transcribe an audio file. The analyst then conducts a text-based keyword search of the ASR transcription output. This method enables the analyst to examine the contents of the audio file and search the audio file according to user-defined search parameters.
To ensure accurate speech detection and retrieval, ASRs and other speech search engines employ search algorithms that attempt to minimize false positives and false negatives. A false positive occurs when a specified search parameter is identified by a search engine as being present in an audio sample when, in fact, the specified search parameter is not present in the audio sample being analyzed. A false negative occurs when a specified search parameter is not identified by a search engine as being present in an audio sample when, in fact, the specified search parameter is present in the audio sample being analyzed. Essentially, a false positive is a detection when no detection should be made, and a false negative is no detection when a detection should be made.
False negatives and false positives are often caused by ASR transcription errors. Two significant causes of transcription errors are the variability of the acoustic channel and the presence of harsh noise conditions. Other contributing factors include speaker variance and the language model used by the ASR system. The result is a speech transcript which is often replete with deletion, substitution and insertion errors, thereby decreasing the reliability and usefulness of the ASR system.
Deletion errors occur when the speech transcript fails to report a word as spoken at a specified time in the audio file when, in fact, the word was spoken at the specified time. Insertion errors can occur when the speech transcription reports a word as spoken at a specified time in the audio file when, in fact, no word was spoken at the specified time. Alternatively, insertion errors can occur when multiple words are reported at a specified time when, in fact, a single word was spoken. Substitution errors occur when the speech transcription fails to properly recognize a word as spoken at a specified time in the audio file and consequently reports a different, and therefore incorrect, word as spoken at the specified time. Deletion errors in a speech transcript can prevent a user from finding what they are looking for within an audio file. Similarly, insertion and substitution errors can be misleading and confusing to an analyst who is attempting to gauge the contents, context, and importance of a reviewed audio file.
Current search and retrieval techniques do not offer relief from the enormous amounts of information the analyst must examine to uncover relevant information. With recent advances in storage technology, storing large amounts of multimedia files in various formats is no longer a problem. However, the critical bottleneck that has emerged for the analyst is the amount of time required to examine all of the stored information. The amount of time that the analyst can devote to any particular file is relatively finite. That is, the analyst simply does not have the time to listen to each audio file in its entirety, watch each multimedia file in its entirety, or read each transcription output in its entirety to determine whether or not they contain any relevant information. Current search and retrieval techniques fail to quickly and reliably direct the analyst to those particular audio segments that have a high probability of containing the items being sought.
An additional limitation of current search and retrieval techniques is the inability to provide an indication of an audio file's context. Merely identifying the presence of a keyword in an audio file is often insufficient. In the absence of context, such a “hit” is often not very useful because there are simply too many of them. Therefore, current search and retrieval techniques fail to identify and fully exploit non-lexical features of audio files that can be used to accurately provide the analyst with a quick understanding of the contents and context of an audio file. Non-lexical features such as background noises, manner of speaking, tension in the voice of the speaker, speaker identity and other parameters can permit the analyst to refine search techniques according to context or other prosodic cues to consequently shorten review time. Prosodic features generally relate either to the rhythmic aspect of language or to suprasegmental speech information sources, such as pitch, stress, juncture, nasalization and voicing.
Given the limitations of current search and retrieval techniques, a need exists for a system that leverages, enhances, and integrates multiple speech technologies with audio-specific information detection to provide rapid indexing and retrieval. That is, a system is needed to identify audio segments of interest based on multiple lexical and non-lexical user-specified parameters, to reduce examination time, provide enhanced information retrieval parameters, and to better meet real world search and retrieval applications.