The present invention relates to the generation of an index. In particular, the present invention relates to producing an index for speech.
Searching through vast collections of documents for a particular document of interest has become commonplace in computing environments. In particular, searches performed on web pages found on the Internet are performed by a large number of search services.
To perform these text-based searches, search services typically construct an inverted index that has a separate entry for each word found in the documents covered by the search service. Each entry lists all of the documents and the positions within the documents where the word can be found. Many of these search services use the position information to determine if a document contains words in a particular order and/or within a particular distance of each other. This order and distance information can then be used to rank the documents based on an input query with documents that have the words of the query in the same order as the query being ranked higher than other documents. Without the position information, such document ranking based on word order is not possible.
Attempts have been made to construct indices for spoken documents, where a spoken document is a speech signal or multiple speech signals that have been grouped together as a single entity. For example, speech signals associated with a particular meeting or a lecture could be grouped as a single spoken document. Also, a multimedia document such as a movie or an animation can be considered a spoken document.
In order to index a spoken document, the speech signals must first be converted into text. This is done by decoding the speech signal using a speech recognition system. Such speech recognition systems use acoustic models and language models to score possible word sequences that could be represented by the speech signal. In many systems, a lattice of possible word strings is constructed based on the speech signal and the path through the lattice that has the highest score is identified as the single word string represented by the speech signal.
In speech indexing systems of the past, this single best estimate of the text from the speech signal is used to create the index for the spoken document. Using a single string output from the speech recognizer provides the ability to mark the position of particular words relative to each other in the spoken document. Thus, the same ranking systems that have been developed for textual indexing can be applied to these spoken document indexing systems.
Unfortunately, speech recognition is not perfect. As a result, the recognized text contains errors. This produces an index with errors, making the systems unreliable during search.
Thus it is desirable to build a speech indexing system that does not suffer from errors created by selecting a best speech recognition result.