1. Field of the Invention
The present invention relates to calculation of confidence measure which represents a degree of correctness of a target word or a target word string, and more particularly to a method and apparatus of the confidence measure calculation based on a degree of matching between the target word string that undergoes confidence measure calculation and an adjacent context in a recognition result.
2. Description of the Related Art
Recently, automatic speech recognition (ASR) systems are widely used to, for example, directly input text data and commands to computer systems by speech. However, even the most advanced speech recognition system cannot produce a speech recognition result containing no recognition error. It is therefore important to calculate a confidence measure of a recognition result so as to automatically detect a recognition error. The confidence measure, which represents the degree of correctness of a recognition result, is so calculated that the greater the confidence measure is, the higher the probability of the recognition result being correct is, whereas the smaller the confidence measure is, the higher the probability of the recognition result being wrong is. For example, in spoken document retrieval, which is one of applications based on speech recognition results, the accuracy of the retrieval is improved by either eliminating recognition results having a confidence measure smaller than or equal to a certain value from a retrieval index list, or by weighting the count of words used in retrieval according to their confidence.
An example of calculation method of the confidence measure of a word in a speech recognition result is proposed in S. Cox and S. Dasmahapatra, “High-level approaches to confidence estimation in speech recognition,” IEEE Trans. Speech and Audio Processing, vol. 10, no. 7, pp. 460-471, 2002, which is herein incorporated by reference. The method proposed by Cox et al. is based on an idea that a correctly recognized word has large semantic relatedness with each adjacent word whereas a wrongly recognized word has small semantic relatedness with each adjacent word.
The method proposed by Cox et al. will be described with reference to FIG. 1 illustrating a configuration of a confidential measure calculation apparatus in a related art which implements the method of Cox et al. A speech of a user is supplied to speech input unit 301 and the supplied speech is then sent to speech recognition system 302 such as the ASR system. The recognition result, i.e., recognized text, is supplied to confidence measure calculation target specifier 303 and adjacent word extractor 304. Text data for training is stored in training text data storage 311.
In the apparatus shown in FIG. 1, the training text data stored in training text data storage 311 is used to calculate in advance semantic relatedness between any two arbitrary words in a manner described below. The calculation of semantic relatedness is performed by semantic relatedness calculator 306 and the result of the calculation is stored in semantic relatedness storage 312. When a speech recognition result is provided from speech recognition system 302, confidence measure calculation target specifier 303 specifies a target word for the confidence measure calculation from the recognition result, and adjacent word extractor 304 then extracts words adjacent to the target word from the recognition result. Finally, confidence measure calculator 305 refers to values stored in semantic relatedness storage 312 to calculate semantic relatedness between the target word and each of the extracted adjacent words and averages the resultant semantic relatedness values. The average is used as the confidence measure of the target word and stored in calculation result storage 313.
The method of Cox et al. uses latent semantic analysis (LSA) to calculate semantic relatedness between any two arbitrary words by using training text data. LSA is a method for determining a degree of co-occurrence between any two arbitrary words in training data. A large degree of co-occurrence between two words means that the two words are likely used at the same time in the training data. Since two words that are often used at the same time are considered to be semantically related to each other to a large extent, a degree of co-occurrence between two words calculated by using LSA is considered to be semantic relatedness between the two words.
A specific method for calculating semantic relatedness based on LSA is as follows: Training data is first divided into a plurality of documents. When the training data is, for example, taken from newspapers, one newspaper article may be used as one document. A term-document matrix whose elements represent weights of the words in the document is then created. A frequently used weight of a word is a term frequency (TF) and a term frequency-inverse document frequency (TF-IDF). Each row vector in a term-document matrix represents a distribution showing how often a corresponding word appears in each document. Singular value decomposition (SVD) is then so performed on the term-document matrix that each word is expressed as a lower-dimension row vector. Since a similar structure between row vectors is maintained in SVD, calculating the degree of cosine similarity between resultant row vectors provides semantic relatedness between the corresponding two words.
As described above, when the semantic relatedness between a target word and each adjacent word is small, it is believed that the target word is likely wrong. In the method for calculating a confidence measure proposed by Cox et al., when the semantic relatedness between a target word and each adjacent word is small, the confidence measure of the target word is also small, whereby a recognition error can be detected based on the calculated confidence measure.
The technology described above, however, is problematic in that even when a target word of confidence measure calculation is a recognition error, a large confidence measure may be obtained in some cases. In that case, the confidence measure of the target word is highly likely to be larger than a predetermined threshold and the target word will be wrongly judged as correct.
The reason for the problem described above is that even when a target word is a recognition error, the semantic relatedness between the target word and an adjacent word may be large in some cases. FIG. 2 shows an example of such a case. FIG. 2 specifically shows a speech recognition result of spoken English news. Assume now that “guerrillas” is selected as a target word for confidence measure calculation from speech recognition result 320, and that “guerrillas” is a recognition error and “gorillas” is correct. It is therefore expected that the semantic relatedness between “guerrillas” and each adjacent word is small. When the semantic relatedness between “guerrillas” and each adjacent word in the recognition result was actually calculated using LSA with a English training text data, “guerrillas” had large semantic relatedness, for example, with “parks,” “protected,” “boundaries,” and “tourism” (i.e., thick italic words in FIG. 2). The reason for “guerrillas” having a large semantic relatedness with “tourism” is because there are in fact articles in the training data about former guerrillas working to rebuild their communities through tourism. As a result, although “guerrillas” was a recognition error, the confidence measure calculated by using the method of Cox et al, increased as opposed to initial intention. In general, since every single word often relates to many other words, the method proposed by Cox et al. does not always lower the relatedness between a target word and an adjacent word when the target word is a recognition error.