1. Field of the Invention
The present invention relates to speech retrieval, and more particularly relates to speech retrieval using related documents.
2. Description of the Related Art
In recent years, research and development on speech retrieval systems has been becoming an increased center of focus.
In general, users want to search for interesting audio files by using characters; however, since character format and audio format are entirely different, it is impossible to directly carry out this kind of search.
In conventional speech retrieval systems, characters (i.e. search terms) and search targets (i.e. audio files) are converted into the same format in most cases. For example, the search terms are converted into audio format, the search targets are converted into text format, or the two are converted into the same third format. However, since speech is polyvariant, these kinds of conversions may cause severe loss of information.
More particularly, there are a few well-used speech retrieval methods as follows.
The first speech retrieval method (i.e. the most well-used speech retrieval method) is converting speech into text by auto speech recognition and then making a search by a text retrieval system. This also is the speech retrieval method which is used in the speech retrieval systems of Google™ and SpeechBot™. This speech retrieval method is helpful to understand contents of audio files by reading the text. However, there are a few drawbacks in this speech retrieval method. First, accuracy of the auto speech recognition is low. In the text obtained by the auto speech recognition, there are many errors that cause accuracy of search results to be low. Second, a lot of information contained in audio files themselves, such as context information as well as emotion, speaking speed, and rhythm of a speaker, is lost. Third, as for some special pronunciations, for example, pronunciations of the Chinese-style English, if there are not big amounts of training data by which an appropriate acoustic model can be obtained, this speech retrieval method cannot work normally at all.
The second speech retrieval method is translating (i.e. converting) speech and text into the same third format such as a phonemic code format, a syllable format, a sub-word format, or a word format, and then using the translated text to search for the translated speech. However, there are a few drawbacks in this speech retrieval method too. First, accuracy of translation is not high. Second, this speech retrieval method often brings confusion. For example, in a case where the two are converted into the phonemic code format, if a user wants to search for “information”, search results including “attention”, “detection”, etc., may be obtained because they have a common pronunciation “-tion”. Aside from these two drawbacks, this speech retrieval method also has the above-mentioned drawbacks of the first speech retrieval method.
The third speech retrieval method is only using related documents of speech to make a general search for information. This speech retrieval method is usually used in searching for music. Since related documents of speech include less information than the speech itself does in general and contents of the speech itself are difficult to be used in this speech retrieval method, the amount of information used in this speech retrieval method is very small.    Cited reference No. 1: U.S. Pat. No. 7,526,425B2    Cited reference No. 2: U.S. Pat. No. 7,542,996B2