With the recent emergence of rich media creation tools, rich media content is being created and archived at a rapid pace. Rich media content can generally refer to a time synchronized ensemble of audio content and/or visual (text, images, graphics, video, etc.) content which is captured from a presentation, lecture, speech, debate, television broadcast, board meeting, video, etc. Metadata content may also be associated with the rich media content. Each of the audio content, visual content, and metadata content types can contain valuable information which may be unique to that content type. For example, a slide presented during a presentation may contain information that was not verbally referred to by the presenter. As a result, locating relevant information within rich media content requires the ability to efficiently analyze and search each type of the rich media content.
Unfortunately, traditional rich media content search engines are unable to effectively implement multi-type (or multi-modal) searching. In most cases, rich media search engines are only capable of searching through a single rich media content type. For example, some rich media search engines utilize a single textual content search engine to search for relevant information within rich media content. The textual content search engine can be used to search through rich media content metadata such as content title, content date, content presenter, etc. Other rich media content search engines utilize a single audio content search engine to locate relevant information. Audio content search engines generally use automatic speech recognition (ASR) to analyze and index audio content such that the audio content can be searched using a standard text-based search engine. These single mode search engines are limited by their inability to locate relevant information in more than a single rich media content type.
More recent rich content search engines have attempted to combine aspects of textual metadata content search engines, audio content search engines, and/or visual content search techniques to improve rich media content searching. However, these search engines are limited in their ability to effectively combine the search results obtained from the different search engines. In addition, audio content search engines are unable to produce reliable search results. Current audio content search techniques utilize either ASR or phonetic matching to generate an audio content transcript which is capable of being searched by a standard textual content search engine.
Automatic speech recognition typically uses a pre-determined vocabulary of words and attempts to identify words within the audio content in order to obtain an audio content transcript. Audio content transcripts generated by ASR are limited because the ASR vocabulary used may not include proper names, uncommon words, and industry-specific terms. The ASR audio content transcripts often contain errors due to a speaker's pronunciation variance, voice fluctuation, articulation, and/or accent. Error rates are usually higher when the ASR system has not been specifically trained for a particular speaker. In many instances, pre-training of a speaker is simply not possible or practical and therefore the ASR system is required to perform speaker-independent recognition. In addition, variances in recording characteristics and environmental noise also increase the likelihood of errors in an ASR system.
Phonetic matching can refer to a technique for locating occurrences of a search phrase within audio content by comparing sub-word units of sound called phonemes. Phonetic matching has several advantages over ASR, including the ability to compensate for spelling mistakes in a search query, the ability to find words which are not in a pre-defined vocabulary, and greater flexibility in finding partial matches between the search query and the audio content. However, as with ASR, results may contain errors due to speaker pronunciation variances and other factors. Thus, there exist many cases where, used alone, neither ASR nor phonetic matching is capable of producing accurate and reliable audio content search results. Current audio content search engines are further limited by their inability to effectively take advantage of other synchronized content types of rich media content such as visual content which is presented in temporal proximity to spoken words.
Thus, there is a need for a multi-type rich media content search system which effectively combines the results of a visual content search, an audio content search, and a textual metadata content search. Further, there is a need for an audio content search system which utilizes both automatic speech recognition and phonetic matching to enhance the accuracy and reliability of audio content search results. Further, there is a need for an audio content search system which utilizes correlated, time-stamped textual content to enhance the accuracy of audio content search results.