Information retrieval (IR) systems typically include a large list of items, such as geographic points of interest (POI), or music album titles. The list is accessed by an index. Input to the index is a query supplied by a user. In response to the query, the IR system retrieves a result list that best matched the query. The result list can be rank ordered according various factors. The input list of items, index, query and result list are typically represented by words. The input list of items, query and result list originates from text or speech.
Spoken queries are used in environments where a user cannot use a keyboard, e.g., while driving, or the user interface includes a microphone. Spoken document retrieval is used when the items to be retrieved are audio items, such as radio or TV shows. In those environments, an automatic speech recognizer (ASR) is used to convert speech to words.
The ASR uses two basic data structures, a pronunciation dictionary of words, and a language model of the words. Usually, the IR system represents the words phonetically as phonemes, e.g., RESTAURANT is represented as “R EH S T R AA N T.” Phonemes refer to the basic units of sound in a particular language. The phonemes can include stress marks, syllable boundaries, and other notation indicative of how the words are pronounced.
The language model describes the probabilities of word orderings, and is used by the ASR to constrain the search for the correct word hypotheses. The language model can be an n-gram. If the n-grams are bigrams, then the bigram lists the probabilities such as P (“BELL”|“TACO”), which is the probability that the word “BELL” follows the word “TACO.” The language model can also be a finite state grammar, where the states in the grammar represent the words that can appear at each state, and the transitions between states represent the probability of going from one state to another state.
There are two main problems with word-based IR.
First, important words for the IR are typically infrequent identifier words. For example, in an item POI “MJ'S RESTAURANT”, the important identifier word is “MJ'S.” Frequently, these identifier words are proper nouns from other languages. For example, the word “AASHIANI” in the item “AASHIANI RESTAURANT” is from the Hindi language. Another way these identifier words emerge is through combination, as with “GREENHOUSE.” Modifying the roots of words also increases the size of the vocabulary. In general, the number of infrequent but important identifier words is very large.
In addition, important identifier words are often mispronounced or poorly represented by the language model. Accurate statistics for the n-grams also are generally unavailable. Hence, the probability of recognizing important infrequent words is low, and the word sequences are often incorrect. This leads to poor recall performance by the IR system.
Second, the computational load for word-based IR systems increases with the size of the list and index, and the performance of system becomes unacceptable for real-time retrieval.