The rapidly increasing amount of spoken data calls for solutions to index and search speech data based on its content. It would be beneficial to be able to use keyword and phrase queries when searching the speech data. It would also be beneficial to be able to search for terms that are not part of the common vocabulary (names, technical terms, etc. . . . ) without affecting the performances of the search.
A known spoken document retrieval uses a large vocabulary continuous automatic speech recognition (ASR) system to transcribe speech to one-best path word transcripts. The transcripts are indexed as clean text. An index stores for each occurrence: its document, its word offset and, optionally, some additional information. A generic IR (information retrieval) system over the text is used for word spotting and search. Some other known systems use richer transcripts, like word lattices and word confusion networks, and have been developed in order to improve the effectiveness of the retrieval.
However, a significant drawback of such approaches is that search on queries containing out-of-vocabulary (OOV) terms will not return any results. OOV terms are words missing from the ASR system word vocabulary and these words are replaced in the output word transcript by alternatives that are probable, given the recognition acoustic model, the language model and the word vocabulary. It has been experimentally observed that over 10% of user queries contain OOV terms, as queries often relate to named entities that typically have a poor coverage in the ASR vocabulary. Moreover, in many applications the OOV rate may get worse over time unless the ASR vocabulary is periodically updated. The problem of OOV terms search applies in different types of speech data like broadcast news, conversational telephony speech, call center data and conference meetings.
A known approach which has been developed in order to handle OOV queries consists of converting the speech data to sub-word transcripts. The sub-words are typically phones, morphemes, syllables, or a sequence of phones. The sub-word transcripts are indexed in a same manner as words in using classical text retrieval techniques but during query processing, the query is represented as a sequence of sub-words. The retrieval is based on searching the string of phones representing the query in the phonetic transcript. Some other systems using richer transcripts like phonetic lattices have been developed.
The main drawback of these approaches is the inherent high error rate of the sub-word transcripts. Consequently, such approaches should be used only for OOV search and not for IV terms. For searching queries containing both OOV and IV terms, this technique affects the performance of the retrieval in comparison to the word based approach.
To summarize, there are two different approaches for speech retrieval. Using the word based approach, which cannot handle queries containing OOV terms and using the sub-words based approach in which the overall performance of the retrieval is affected.
An improvement in word searching accuracy has been shown using a combination of word and sub-word (e.g., phone) transcripts. This proposes three different retrieval strategies:                1. Search both the word and the sub-word indices and unify the two different sets of results;        2. Search the word index for IV queries, search the sub-word index for OOV queries; or        3. Search the word index; if no result is returned, search the sub-word index.However, no known strategy can handle searches for phrases containing both IV and OOV words.        
An aim of the present invention is to be able to process all kind of queries, including hybrid phrases combining both IV and OOV terms, without affecting the performance of the retrieval. A solution is to merge the two different approaches presented above: the word based approach for IV terms and the sub-word based approach for OOV terms. However, for hybrid query processing, the posting lists retrieved from a word index and a sub-word index would need to be combined. In conventional IR, for each occurrence of a term, the index stores its sentence number and word offset. Consequently, it is not possible to merge posting lists retrieved by a sub-word index with those retrieved by a word index since the sentence number and offset of the occurrences retrieved from the two different indices are not comparable.