1. Field of the Invention
This invention relates generally to methods and systems for document expansion. More specifically, the invention, relates to methods and systems for performing document expansion for speech retrieval.
2. Description of the Related Art
Increasing amounts of spoken communications are stored in digital form for archival purposes (e.g., broadcasts), or as a byproduct of modern communications technology (e.g., voice mail). Multimedia documents and databases are also becoming increasingly popular, e.g., on the World-Wide-Web (www). There has thus been an interest in developing tools for searching spoken information that complement existing methods for searching textual information.
With advances in automatic speech recognition (ASR) technology, it is now possible to automatically transcribe speech with reasonable accuracy. Once the contents of a speech database or the audio portions of a multimedia database are transcribed using a speech recognition system, traditional information retrieval techniques can be used to search the database. However, inaccuracies in automatic transcriptions pose several new problems for information retrieval (IR) technology in speech retrieval. For poor automatic transcriptions, retrieval effectiveness is much worse than effectiveness for human transcriptions. Due to various factors, including background non-speech sounds (noise, music), poor recording conditions, and disfluent or non-native speech, it is often not possible to get good automatic transcriptions even with the best ASR systems.
Even though IR techniques have been successfully used in retrieving corrupted text generated by optical character recognition (OCR) systems, the kinds of errors in automatic speech transcriptions are very different from those in OCR transcriptions. Since OCR systems usually operate with single characters, errors in character recognition usually produce illegal words which do not affect the retrieval process substantially. In contrast, current high-performance, large-vocabulary speech recognizers rely on word-pronunciation dictionaries and whole outputs consists only of legitimate words drawn from the dictionary. Recognition errors are then deletions, insertions or substitutions of legitimate words, and are therefore not easily discarded.
One of the main problems in performing word- and phrase-based speech retrieval with current methods arises due to poor index term assignments for automatic speech transcriptions. From its early days, the field of IR has wrestled with the question of which index terms should be assigned to a given document. Defining the concepts which a document is about, —“aboutness” in subject indexing—has been visited several times over the history of IR. Experimentation has shown that automatically-derived, uncontrolled index terms are competitive with carefully crafted manual index terms. Most modern IR systems use automatically derived words and phrases as index terms for documents. However, any indexing system, including word- and phrase-based automatic indexing, is imperfect and may thus fail to index the relevant documents under the query terms even though the documents are about those terms. This has often been called the “vocabulary mismatch” problem. This problem is made worse by speech recognition errors, since the automatic transcription of a document may not contain all the terms that were actually spoken, or may contain terms that were not spoken.
A secondary problem in index term assignment is deciding, for an index term assigned to a given document, the “degree” to which that document is about that term. Modern IR systems use sophisticated term-weighting methods to define the degree of aboutness of documents for different terms. When documents are corrupted, as is the case in speech retrieval, term-weighting schemes assign misleading weights to terms. This might also cause some loss in retrieval effectiveness.
Many devices and methods have been proposed over the years to attack the vocabulary mismatch problem, most notably the use of Thesaurii to enhance the set of index terms assigned to documents or to queries. However, obtaining a reliable Thesaurus for any subject area is quite expensive. Attempts have been made to harness word-to-word associations for automatic Thesaurus construction, but these attempts have been disappointing. More recently, however, it has been shown that enhancing queries with terms related to the entire concept of the query (often referred to as “query expansion”), and not just with words related to individual query words, reduces the problem of vocabulary mismatch considerably and consistently yields large improvements in retrieval effectiveness, especially for short queries.
Correspondingly, document expansion can be used to enhance the index term assignment for documents. Many studies have utilized enhanced document representations using bibliographic citations and references. Research on the use of spreading activation models in IR also aims at crediting documents based on activation of related documents. However, both these techniques need some human supervision (in the form of human generated citations, or the semantic net used) to be made operational.
Document clustering, which doesn't require any human supervision, can also be interpreted as a form of document expansion. When similar documents are clustered and a cluster representative is used in the search process, the cluster representative usually contains terms from all the documents in the cluster, in effect allowing a match between a document and a query (via the cluster representative) even when individual query terms might be missing from the document but are present in other documents in the cluster. Extensive studies on document clustering have given mixed results at best. Work on Latent Semantic Indexing (LSI) also produces similarly mixed results. LSI allows a match between queries and documents that might not share any terms in word-space, but do share some concepts in the LSI.
An alternative to word-based approaches is to recognize sub-word units (for instance, phonemes) and to use sequences of these sub-word units as index terms. However, it is unclear if the results from this approach are competitive with word-based approaches now that very-large vocabulary recognition systems are available. It is also possible to simultaneously use as index terms words from the best word transcription and phonetic n-grams from phoneme lattices.
There thus exists a long-felt, but unresolved need in the art for document expansion for speech retrieval systems. The methods and systems to perform this task should be versatile and efficient, performing speech retrieval in short periods of time. These results have not heretofore been achieved in the cut.