In many contexts, users of large collections of recorded audio (audio information) value the ability to quickly perform searches for words or phrases in the audio. For example, in the context of corporate contact centers (e.g., call-in centers), recorded conversations between customers and customer service representatives (or agents) can be searched and analyzed to identify trends in customer satisfaction or customer issues, to monitor the performance of various support agents, and to locate calls relating to particular issues. As another example, searchable recordings of classroom lectures would allow students to search for and replay discussions of topics of particular interest. Searchable voicemail messages would also allow users to quickly find audio messages containing particular words. As another example, searchable recordings of complex medical procedures (e.g., surgery) can be used to locate recordings of procedures involving uses of particular devices, choices of approaches during the procedure, and various complications.
Generally, Automatic Speech Recognition (ASR) systems, and Large Vocabulary Continuous Speech Recognition (LVCSR) transcription engines in particular, include three components: A set of Language Models (LM), a set of Acoustic Models (AM), and a decoder. The LM and AM are often trained by supplying audio files and their transcriptions (e.g., known, correct transcriptions) to a learning module. Generally, the LM is a Statistical LM (SLM). The training process uses a dictionary (or “vocabulary”) which maps recognized written words into sequences of sub-words (e.g., phonemes or syllables) During recognition of speech, the decoder analyzes an audio clip (e.g., an audio file) and outputs a sequence of recognized words.
A collection of audio files (e.g., calls in a call center or set of lectures in a class) can be made searchable by processing each audio file using an LVCSR engine to generate a text transcript file in which each written word in the transcript (generally) corresponds to a spoken word in the audio file. The resulting text can then be indexed by a traditional text-based search engine such as Apache Lucene™. A user can then query the resulting index (e.g., a search index database) to search the transcripts.
Generally, the recognized words in the output of a LVCSR engine are selected from (e.g., constrained to) the words contained in the dictionary (or “vocabulary”) of the ASR system. A word that is not in the vocabulary (an “out-of-vocabulary” or “OOV” word) may be recognized (e.g., with low confidence) as a word that is in the vocabulary. For example, if the word “Amarillo” is not in the vocabulary, the LVCSR engine may transcribe the word as “ambassador” in the output. As such, when using such ASR systems, it may be impossible for an end user to search the index for any instances of words that are not in the vocabulary.
One way to overcome this problem is to add the OOV word to the dictionary (i.e., to add the word to the vocabulary) and to generate a new LM (which can be a SLM or a constrained grammar) and then reprocess the audio files. However, such an approach would increase the delay in generating the search results due to the need to reprocess the audio corpus.
In other ASR systems, the output data is sub-word level recognition data such as a phonetic transcription of the audio rather than a LVCSR output or a similar word based transcript. Such ASR systems typically do not include a word vocabulary. Instead, these engines provide a way to search for any sequence of characters. In this case, the search is performed by mapping the search phrase into a sequence of phonemes and searching for the given phonetic sequences in the phonetic transcription index. These engines are generally considered to be less accurate than LVCSR based engines because the notion of words is not inherent to the recognition process, and the use of words (e.g., the meanings of the words) are generally useful for improving the accuracy of the speech recognition.
Generally, combining word and phoneme levels of automatic speech recognition will not solve the accuracy problems of phonetic-based methods given that, the accuracy limitations of purely phonetics-based methods would still persist for queries that included at least one OOV word.