1. Field of the Invention
The present invention relates to a speech retrieval apparatus and a speech retrieval method, and more particularly relates to a speech retrieval apparatus and a speech retrieval method based on a holo-speech search (HSS) for searching, in a speech database, for an audio file matching an input search term with high precision and high recall.
2. Description of the Related Art
In recent years, audio files such as those used for broadcasting, TV, podcasting, audio learning, and voice mailbox can be found everywhere around computers, networks, and everyday life with the further popularization of audio applications. It is getting more difficult to find and locate a desired audio file for a user with the increase of the amount of speech information.
In a conventional text search method, an index file is created for original data so that the appearance position of a search term can be located rapidly. Currently a mainstream method is creating an inverted file table in units of words. Each file is formed by a sequence of words, and a search condition input by a user is generally formed by a few words. As a result, if the appearance positions of these words are recorded in advance, the file containing these words can be found once these words are found in the index file.
In a conventional speech retrieval system, a speech search is carried out by using a speech recognition result and corresponding lattice information, or is carried out only by using the lattice information. In order to increase retrieval speed, the text search method is also used in some conventional speech retrieval systems. However, only a text search term can be generally dealt with in this kind of systems. The significance of the lattice information is as follows. In the field of speech search, only the most preferred result can be obtained by using speech recognition in the usual case. However, it is possible to obtain plural possible speech recognition results in a certain range of confidence if using the lattice information; thus there are more choices. When making a search, it is possible to search for said more choices so that the problems of recognition errors, out-of-vocabulary (OOV) words, etc., can be alleviated to some degree.
OOV means exceeding the scope of a dictionary. An acoustic model and a language model are normally used in the speech recognition; they are mathematical models obtained by training using artificially annotated real language data. If a pronunciation or a word does not appear in the real language data at all, it cannot be recognized when making the speech recognition. This causes an OOV problem. A common OOV problem is mainly concentrated on words of geographical names, personal names, etc.
Some features of the audio file, such as a phonemic code, a sub-word unit, and a speech recognition result, may be considered for the speech search. In general, the corresponding lattice information of the phonemic code, the sub-word unit, and the speech recognition result can be obtained in a recognition process.
A phonemic code is the smallest segmental unit of sound used to form meaningful contrasts between utterances in a language or dialect. A phoneme is a concretely existing physical phenomenon. The International Phonetic Alphabet (IPA) is a meaningful text assembly; details of the IPA include that the phonetic alphabet can be used to represent the sounds of any language. Compared with the means of speech search using the following sub-words unit, the means of speech retrieval using the phonemic code can effectively alleviate the problems of OOV words, insufficient training data, and recognition errors; however, it may bring some noise to the retrieval result.
A sub-word unit is a combination of meaningful phonemic codes in the face of statistics; it is a meaningful text assembly, and coincides with the regular pronunciation habits of human beings. The means of speech search using the sub-word unit can alleviate the problems of OOV words and insufficient training data to some degree. In the aspect of recognition errors, this means is better than the means of speech search using the following speech recognition result, but worse than the means of speech search using the phonemic code. It is possible to alleviate the noise by using this feature. Retrieval precision of using this feature is higher than that of using the phonemic codes, but lower than that of using the following speech recognition result.
A speech recognition result is a character result having a real meaning of language, of the audio file; thus it is human-readable information. The means of speech search using the speech recognition result may cause the problems of OOV words, non-native language, insufficient training data, recognition errors, etc. It is often difficult to solve the above problems if only using this feature. In a case without the appearance of the problems of OOV words, non-native language, insufficient training data, recognition errors, etc., retrieval precision is high. But if the above problems occur, there may not be any retrieval result, or a retrieval error may occur.
Some concepts in the field of speech search are briefly introduced as follows.
(1) Precision and Recall
Precision can be seen as a measure of exactness or fidelity, whereas recall is a measure of completeness. In an information retrieval scenario, precision is defined as the number of relevant objects retrieved by a search divided by the total number of objects retrieved by that search, and recall is defined as the number of relevant objects retrieved by a search divided by the total number of existing relevant objects which should have been retrieved.
(2) Ranking
A retrieval system may return corresponding files only according to a logical relationship between a search term and the files. If it is necessary to further express a deep relationship between the results and the search term, in order to show the result most coincided with a user demand in front, it is also necessary to rank the results by using various data. Currently there are two mainstream techniques for analyzing a correlation between retrieval results and a search term, used for ranking; they are link analysis and calculation based on contents.
(3) Speech Division
It means dividing an audio file into segments which can be indexed.
(4) Speech Data
Data of the bottom layer of both a speech search term and an audio file in a speech database are characters. If the character segments of the search term are the same as the character segments of the audio file, the search term and the audio file are considered matching. Matching is based on a division; sub-word units formed after the division are the character segments. If a character segment, for example, “ABCD” in the sub-word unit dimension of a search term, and a character segment, for example, “ABCD” in the sub-word unit dimension of an audio file, entirely match, the search term and the audio file are considered matching entirely in the sub-word unit dimension. Besides entire matching, there is a fuzzy matching. The fuzzy matching works with matches that may be less than 100% perfect when finding correspondence between two segments. For example, like “ABCD” and “AECD”, or “ABCD” and “ABCE”, if 75% of characters are the same, they can be considered matching. Matching in the other dimension (for example, the phonemic code or the speech recognition result) is the same; either the entire matching or the fuzzy matching can be used.
In addition, U.S. Pat. No. 7,542,966 discloses a speech retrieval system in which the phonemic code, the sub-word unit and the corresponding lattice information are used. However, the speech recognition result is not used, and only the speech search term can be dealt with.
In all of conventional speech retrieval techniques, various features of speech are not comprehensively used for making a search. Therefore the problems of OOV words, a lot of recognition errors, non-native language, insufficient training data, etc., cannot be overcome; retrieval precision, retrieval speed, and error robustness cannot be improved; and a text search term and a speech search term cannot be dealt with at the same time.