The rapidly increasing amount of spoken data calls for solutions to index and search this data. The classical approach consists of converting the speech to word transcripts using large vocabulary continuous speech recognition (LVCSR) tools. In the past decade, most of the research efforts on spoken data retrieval have focused on extending classical information retrieval (IR) techniques to word transcripts.
However, a significant drawback of such approaches is that search on queries containing out-of-vocabulary (OOV) terms will not return any results. OOV terms are words missing in the automatic speech recognition (ASR) system vocabulary. Those words are replaced in the output transcript by alternatives that are probable, given the recognition acoustic model and the language model. It has been experimentally observed that over 10% of user queries can contain OOV terms, as queries often relate to named entities that typically have a poor coverage in the ASR vocabulary.
In many applications, the OOV rate may get worse over time unless the recognizer's vocabulary is periodically updated.
An approach for solving the OOV issue consists of converting the speech to phonetic transcripts and representing the query as a sequence of phones. Such transcripts can be generated by expanding the word transcripts into phones using the pronunciation dictionary of the ASR system. This kind of transcript is acceptable to search OOV terms that are phonetically close to in-vocabulary (IV) terms.
Another way would be to use sub-word (phones, syllables, or word-fragments) based language model. The retrieval is based on searching the sequence of sub-words representing the query in the sub-word transcripts. The main drawback of this approach is the inherent high error rate of the transcripts and such sub-word approaches cannot be an alternative to word transcripts for searching IV query terms that are part of the vocabulary of the ASR system.
Many techniques can be used to generate transcripts. Above are described sub-word-based and word-based approaches that have been used for IR on speech data; the former suffers from low accuracy and the latter from limited vocabulary of the recognition system.