Information retrieval systems that retrieve documents pertinent to a text query are common. Documents are typically a collection of words indexed either directly by the words in the collection, or through linear transformations of word-count vectors, often referred to as document vectors. Queries can also be represented as sets of words that are used to retrieve documents from the index, or as word-count vectors that are compared to the document vectors to identify the documents that are most relevant to the query. Relevant documents that are returned to a user are often called a result set.
The increasing availability of automatic speech recognition (ASR) systems has permitted the extension of text-based information retrieval systems to systems where either the documents or the queries are spoken.
Spoken document retrieval systems can index audio recordings of broadcast news programs, podcasts, recordings of meetings, lectures, presentations, and the like. Typically, the spoken documents are first transcribed into text, either manually or using ASR systems. The resulting words in the text are stored in an index to a database. Queries are matched to the word index, and either the textual transcription or the audio recording is returned to the user.
Spoken query systems use speech to query the document retrieval system. Once again, the query is converted to word form using the ASR system and matched to the index for retrieval.
In all of the above, the basic units used by the indexing system are words. In purely text-based systems, where both the documents and the queries are text, documents are indexed by the words in the documents, and the words in the queries are matched to those in the index. When the documents or the queries are spoken, the words are first converted to word sequences or word lattices, which are then used to construct the word index or to match the query against the word index.
Word-based indexing schemes have a basic restriction, particularly when the queries or documents are spoken. ASR systems have limited vocabularies. The vocabulary of words that the systems can recognize must first be specified. This also means that the vocabulary of the recognizer must be updated whenever a document that includes a word that is not currently in the recognizer's vocabulary is added to the index.
In the case of spoken documents, this presents a problem because the vocabulary of a new document cannot be completely known a priori. For spoken queries, this would imply that the system that is used to input the query must be updated whenever the document index is updated. This is an unrealistic requirement in many applications. Even when both the documents and the queries are purely text-based, word-based indexing faces the problem of misspelling. Words in the query are often spelt differently by users than in documents, particularly when the words are novel or complicated. Clearly, retrieval is adversely affected when words spellings in the documents and the queries do not match.
Document retrieval systems usually return one or more documents from a database that are deemed to be relevant to the words in a query by a user. The interpretation of the term “document” can be quite general. For instance, retrieval of documents from the web, as well as retrieval of files from a personal computer, or retrieval of music from a collection of songs described by metadata can all be regarded as instances of ‘document’ retrieval.
Obviously, not all the information in documents lends itself well to a tree structured dialog that can be traversed by menus. The information has to be retrieved using techniques that are commonly referred to as “information retrieval,” (IR) that do not depend on the structure of the information in the document.
Documents can not always be text-based. Documents can also include recordings of spoken data, such as broadcast news programs, seminars and lectures, public addresses, meetings, etc. Similarly the queries that are used to retrieve documents from a database need not necessarily be textual. The queries can also be spoken.
Text-Based Retrieval
FIG. 1 shows a conventional text-based systems, documents 101 and queries 102 are both in text form. The set of words or word patterns extracted 103 from all documents is used to construct a document index 104. Words or word patterns are also extracted 105 from the query. The index either has each word pointing to every document in which the word appears, or the index has a word count vector for each document. The word count vector has the number of times each word occurs in the document.
The queries are then processed in a manner consistent with the structure of the index, and a result set of documents 107 is scored and ranked 106, and returned to the user.
Spoken Document Retrieval
As shown in FIG. 2, spoken documents 201 include audio recordings of speech, such described above. The speech is recognized 202. It is sometimes desired to index and retrieve such documents in response to the queries 102.
The conventional approach to retrieval of spoken documents has been to convert the documents to sequences of words using the ASR system. The converted documents are then indexed and retrieved in the same manner as text documents.
It is well known that ASR systems are inaccurate by nature. The recognized words for any document can therefore contain several errors that will result in retrieval of incorrect documents in response to a query. To account for this, documents are often represented in terms of the word lattice that is considered by the recognizer when decoding the documents. Alternately, the documents can be represented by the n-best list, i.e., the top N recognition hypotheses that the recognizer generated for the document. The document is then indexed by words (or word-count vectors) derived from the word lattice, or the n-best lists. The rest of the indexing and retrieval processes are the same as for text documents.
As shown in FIG. 3, an alternative approach converts 301 the spoken documents into sequences or lattices 302 of phonemes, or syllables of the words. Documents are represented in their entirety in terms of these lattices. Words in the query are then matched to the sequences or lattices in the documents to identify candidate documents that contain sequences that can match the words in the query.
Retrieval from Spoken Queries
It is not always convenient to type text in queries, e.g., when using small handheld devices, or while driving a vehicle or operating a machine. Text entry may be inconvenient, or even impossible. In such situations, users can speak their queries. Spoken query systems attempt to retrieve documents using the words in the spoken queries.
As in the case of spoken document retrieval, spoken queries are first converted to words by the ASR system. Once again, the documents can be converted to a linear sequence or a lattice of words. The words in the test form of the query are used to retrieve documents from the index, e.g., see U.S. Pat. No. 6,877,001 issued to Wolf, et al., Apr. 5, 2005, “Method and system for retrieving documents with spoken queries.” incorporated herein by reference.
Other systems can combine both textual and spoken documents in their index, and permit both spoken and text-based queries. In all cases, the basic unit that is used to match documents to queries is the word.
Drawbacks of Word-Based Matching
Retrieval of text documents using text queries is probably the most reliable of all forms of document retrieval. Nevertheless, it has its restrictions. The key words in documents that distinguish the documents from, others are often novel words, with unusual spelling. Users who attempt to retrieve these documents will frequently be unsure of the precise spelling of these terms and misspell the words. Any word-based mechanism for retrieval will not be able to match the misspelled words to the corresponding documents. To counter this, many word based systems use various spelling-correction mechanisms that alert the user to potential misspelling, but even these will not suffice when the user is basically unsure of the spelling.
Spoken documents must first be converted to words using the ASR system. ASR systems have limited vocabularies, even if they are very large. Even extremely large vocabulary systems typically include the most commonly used tens of thousands of words, or, in extreme cases, hundreds of thousands of words, in their recognition vocabulary. This immediately gives rise to several problems. Firstly, the key distinguishing terms in any document are, by nature, unusual, or they would not distinguish the document from others. As a result, these very words are the least likely to actually be present in the recognizer's vocabulary, and are thus unlikely to be recognized. To counter this, the key words in the documents must be dynamically added to the recognizer's vocabulary prior to recognition. A natural problem arises here. In a novel document, the key words to be found can not be known a priori.
Secondly, ASR systems are statistical machines that are biased a priori to recognize frequent words more accurately than rare words. As a result, even when the key words in any document have actually been included in the ASR system's vocabulary, the key words are highly likely to be misrecognized, thereby voiding the rationale for including them in the system's vocabulary. As a compensating factor, key words in documents are typically repeated multiple times in the spoken document, and the probability that the recognizer will miss all instances of the word is considerably lower than that the recognizer will miss any single instance. Thus, spoken document retrieval systems are able to function reasonably, even when the accuracy of the recognizer is relatively low.
Even when the spoken documents are actually transcribed into lattices in order to reduce the effect of out-of-vocabulary terms, queries are nevertheless whole words that must be matched to the document, and will suffer from the misspelling problem described above. More important, this will require each word in the query to be matched against the entire particle lattice for each document in order to score the documents, making the entire process extremely inefficient.
Spoken query systems are, perhaps, the least reliable of all document retrieval systems. Queries are usually converted to sequences or lattices of words by the ASR system as described above. Queries are typically short. Clearly, the cost of a single misrecognition is extremely high.
To be recognized, the key words that the user expects to find in the document must be included in the recognizer's vocabulary. This means that as documents are added to the index, the key words in the documents must first be included in the vocabulary of the recognizer that processes the queries. This can be particularly burdensome on systems where the query is initially processed by a remote client. Each update to the index must promptly be communicated to every client that aims to use the index. This operation can become very time consuming
Even if the query processing is performed on servers that are collocated with the index, time restrictions are a problem. Users require the response to a query to be prompt. The speed with which the ASR system operates depends on the vocabulary, and each update of the document index that results in an increase of the recognition vocabulary will slow down the ASR system and increase the latency of the retrieval. The amount of memory used by the ASR system will also increase non-linearly with increasing vocabulary, restricting the number of queries that can be processed concurrently.