There has been a dramatic increase in the available audio content in the enterprise environment. Audio content includes both audio documents (streaming audio and audio recordings) as well as the audio portion of video documents (streaming video and video recordings). Audio content is contained in online lecture videos, archived meetings, archived conference calls, and voicemail.
Because there often is a great deal of audio content in the enterprise environments, it frequently is desirable to be able to search the audio content of documents. However, unlike Internet audio and video content, the enterprise setting offers little meta-data such as anchor text, surrounding text, closed captions. Thus, indexing such meta-data, while successful for Internet content, results in poor search accuracy in the enterprise context.
Another way to search enterprise audio content is by using text indexing. Typical text indexing uses speech-to-text (STT) algorithms and indexes the words that are output from these algorithms. However, for typical enterprise audio content, state-of-the-art speech recognition software achieves speech-to-text word accuracies of only about 50-60%. Thus, this direct speech recognition approach results in suboptimal search accuracy.
One way to substantially improve the search accuracy of speech-recognition based text indexing is by indexing “word lattices” instead of just single words. Word lattices are representations of alternative recognition candidates of a word that were also considered by the speech recognizer, but did not turn out to be the top-scoring candidate. This is a form of speech recognition results, but contains more information. In particular, each word lattice contains at least three types of information: (1) a possible replacement for the query word (or candidates for replacement); (2) time boundary information of the query word (a start time and an end time); and (3) a confidence level or score for the query word.
The use of word lattices improves accuracy in two ways. First, there are less false positives because word lattices provide confidence scores that can be used to suppress low-confidence matches. Second, there are less false negatives. This is because word lattices discover sub-phrases and AND matches where individual words are of low confidence. The fact that the individual words are queried together, however, allows the inference that they still may be correct. Using the lattice approach instead of only using speech recognition improves the accuracy of the audio content search by 60 to 140%. Thus, this lattice approach works well for indexing audio content.
There are problems, however, when trying to use lattice approach to deal with a real-world application. It is desirable to use existing text indexers to index the word lattices. However, text indexers are able to index only simple words and phrases, and lattice structures are quite complicated and contain additional information. For example, a Structured Query Language (SQL) full-text engine has no field in which to store a confidence level. Moreover, in SQL word positions are not sufficient because word alternates may be not aligned. For example, an alternate phrase may span two words. Further, the original text-ingestion plug-in interface (or IFilter) of the SQL full-text engine does not allow the output of more than one word for each word position. Thus, word lattices cannot be indexed by traditional text indexers.