One of the fundamental difficulties in automatic speech processing is finding a spoken or written term in a collection of audio recordings. Given the vast amount of existing spoken information, with more being produced every day, there is an increasing need for small indices and fast searches.
Typically, known spoken term detection (STD) systems work in two phases: (1) transforming the speech into text format using an automatic speech recognition system (ASR); and (2) building an index from the text. A relatively simple textual format is the 1-best hypothesis from an ASR system. This approach can result in good STD performance if the speech recognition system has low word error rate.
Many known STD systems benefit from having a richer ASR output representation. Several retrieval methods dealing with multiple hypotheses from an ASR system have been proposed, with lattices and confusion networks being used for building STD indices. However, this approach is not able to find terms that are not in the dictionary of the speech recognizer. Many known STD systems index speech recognition lattices and use this index to search for queries. When the keywords are not in the recognition vocabulary (out-of-vocabulary (OOV)), the word indices are not sufficient. In this case, both the OOV queries and the word lattices can be expanded to a phone level using the ASR lexicon.
Approaches based on sub-word units (e.g., phone, graphone, syllable, morph) have been used to solve the OOV issue. For example, retrieval includes searching for a sequence of sub-words representing an OOV term in a sub-word index. Some known approaches are based on searches in sub-word decoding output or searches on the sub-word representation of the word decoding. For example, in order to be able to find OOV terms, speech recognition is performed using sub-word (e.g., morph, fragment, phone) units, or using words, which are mapped to sub-words before a keyword search.
To compensate for errors made by an ASR system, a query term can be expanded using a sub-word confusability model. Since subword-based indices generally yield a lower precision for in-vocabulary (IV) queries compared with word-based indices, the word and subword indices are either used separately for IV and OOV searches, respectively, or combined into one index.