The Internet provides worldwide access to a huge number of databases storing publicly available multi-media content and documents. Much of the content is in the form of audio and video records. Typically, browsers and search engines executing on desktop systems are used to retrieve the stored documents by having the user specify textual queries or follow links. The typed queries typically include key words or phrases, and the output is also text or images.
Portable communications devices, such as cellular telephones and personal digital assistants (PDAs), can also be used to access the Internet. However, such devices have limited textual input and output capabilities. For example, keypads of cell phones are not particularly suited for typing input queries, and many PDAs do not have character keys at all. The displays of these devices are also of a limited size and difficult to read. These types of devices are better suited for speech input and output, particularly if the document includes an audio signal, such as speech or music. Therefore, spoken queries are sometimes used.
Prior art document retrieval systems for spoken queries typically use a speech recognition engine to convert a spoken query to a text transcript of the query. The query is then treated as text and information retrieval processes can be used to retrieve pertinent documents that match the query.
However, that approach discards valuable information, which can be used to improve the performance of the retrieval system. Most significantly, the entire audio spectral signal that is the spoken query is discarded, and all that remains is the raw text content, often misinterpreted.
When either the documents or the query are specified by speech, new techniques must be provided to optimize the performance of the system. Techniques used in conventional information retrieval systems that retrieve documents using text queries perform poorly on spoken queries and spoken documents because the text output of speech recognition engine often contains errors. The spoken query often contains ambiguities that could be interpreted many different ways. The converted text can even contain words that are totally inconsistent within the context of the spoken queries, and include mistakes that would be obvious to any listener. Simple text output from the speech recognition engine throws away much valuable information, such as what other words might have been said, or what the query sounded like. The audio signal is usually rich and contains many features such as variations in volume and pitch, and more hard to distinguish features such as stress or emphasis. All of this information is lost.
Therefore, it is desired to improve information retrieval systems that use spoken queries. Moreover, it is desired to retain certainty information of spoken queries while searching for documents that could match the spoken query. Particularly, document retrieval would be improved if the probability of what was said or not said were known while searching multi-media databases.