1. Technical Field
The present invention relates to a method and apparatus for speech recognition and, more specifically, a method and apparatus for speech recognition to recognize natural human speech in a text form and prepare text data after automatically screening out meaningless words called disfluencies.
2. Description of the Related Art
Statistical methods for recognizing speech using acoustic models and language models have been known in the art. Examples of such methods are described in papers such as “A Maximum Likelihood Approach to Continuous Speech Recognition” (L. R. Bahl et al., IEEE Trans. Vol. PAMI-5, No. 2, March 1983) and “Word-based approach to large-vocabulary continuous speech recognition for Japanese” (Nishimura et al., Information Processing Society of Japan, Vol. 40, No. 4, April 1999). Briefly, those methods can include generating and speaking a text or word sequence, which can be referred to as W. The speech can be processed by an acoustic processor into a series of signals, from which a feature of the speech, which can be referred to as X, can be extracted. A recognition result, which can be referred to as W′, can be determined or outputted as the most suitable result based on the expression below, as well as the feature X and the text W. Thus the text can be constructed. The expression (Expression 1) being: Namely, a probability P(X|W) of said feature (X) when a word sequence W is spoken can be multiplied by a probability of W itself (P(W)). The word sequence W′ which makes the multiplication product the largest (argmax) can be selected as a recognition result.
Acoustic models can be used for calculating the former probability P(X|W), and the words which make this probability large enough can be selected as candidates for the recognition results. On the other hand, what often can be used for approximating the latter probability P(W) can be language models, more specifically, N-gram models. This is a method for approximating the appearance probability of an entire text or a word sequence W, based on probability of a group of consecutive N (integer) words. The method can be expressed in the form of the following expression (Expression 2):             P      ⁡              (        W        )              =                  P        ⁡                  (                      w            0                    )                    ⁢              P        ⁡                  (                                    w              1                        |                          w              0                                )                    ⁢              P        ⁡                  (                                    w              2                        |                                          w                0                            ⁢                              w                1                                              )                    ⁢      x        ,  …  ⁢           ,            P      ⁡              (                                            w              n                        |                          w              0                                ,                      w            1                    ,          …          ⁢                                           ,                      w                          n              -              1                                      )              ≅                  P        ⁡                  (                      w            0                    )                    ⁢              P        ⁡                  (                                    w              1                        |                          w              0                                )                    ⁢                        ∏                      i            =            2                    n                ⁢                                   ⁢                  P          ⁡                      (                                                            w                  i                                |                                  w                                      i                    -                    2                                                              ,                              w                                  i                  -                  1                                                      )                              It is supposed in this expression that the probability of a word w[n] depends only on N−1 (integer) words immediately preceding the word in question. The value of N can be varied, but N=3 is often used through a trade-off between effectiveness of the model and the size of data required for learning. Expression 2 shows the case of N=3, as well.
For example, if the n-th word of a text W consisting of n (integer) words is hereafter expressed as w[n], then the probability of a word sequence W can be calculated as a multiplication product of all the probabilities of the appearance of the word w[n] under a condition of N−1 (namely 2) words, that is, P(w[n]|w[n−2],w[n−1]). Here, in the expression at the left of “|” (w[n]) indicates the object word of recognition. (w[n−2],w[n−1]) at the right of “|” indicates 2 words immediately prior to the object word, which constitute a condition for predicting the word w[n]. The conditional probability P(w[n]|w[n−2],w[n−1]) for each of various words w[n] can be learned through studies of text data prepared separately and stored as a database in the form of a dictionary. For example, the probability of the word “word” appearing at the beginning of a text can be 0.0021; and, the probability of the word “search” coming immediately after the word “word” can be 0.001, and so on.
The above N-gram model can be good enough for the recognition of speech read from a prepared text, but written copies are rarely given in areas where speech recognition is applied. More important than recognition of speech read from prepared texts, however, can be the application of the technique to recognize spontaneous speech. In such cases, normal words having semantic contents or meanings, in addition to interjectory expressions including “well” and “you know”, and meaningless words such as “um” and “er” can be pronounced. These words can be called unnecessary words, disfluencies, or disfluency words. Accordingly, an N-gram model capable of dealing with disfluencies for automatically screening them out can be beneficial to a speech recognition system.
Conventional extensions of the N-gram model proposed for the above purpose have utilized a concept referred to as “transparent word.” Some of those proposed extensions are described in reports such as “Dealing with Out-of-vocabulary Words and Filled Pauses in Word N-gram Based Speech Recognition System” (Kai et al., Information Processing Society of Japan, Vol. 40, No. 4, April 1999) and “A Study on Broadcast News Transcription” (Nishimura, Ito, Proceeding of the Fall Meeting of the Acoustical Society of Japan, 1998). In the extension models described in the former reports, for example, probability calculations can be made ignoring the existence of disfluencies, either during learning, which can be referred to a training, or recognition. The calculations can be made on an assumption that disfluencies appear comparatively freely between phrases and hence N-grams. This assumption can be a constraint on co-occurrence and, thus, cannot be expected to work effectively. For example, when a word w[n−1] is a disfluency, rather than calculating the probability w[n] as P(w[n]|w[n−2],w[n−1]), the probability of a word w[n] can be estimated as P(w[n]|w[n−3],w[n−2]) ignoring w[n−1]. In this case, the disfluency, which is the word ignored or skipped, is called a “transparent word.” Probabilities can be calculated in this model on an assumption that disfluencies appear between non-disfluency words (normal words) with an equal probability.
Some reports say, however, that the assumption to the effect that disfluencies actually carry no information and appear freely between normal words is not true in the English language. For example, in a paper titled “Statistical Language Modeling for Speech Disfluencies” (A. Stolcke, E. Shriberg, Proc. of ICASSP96), it is stated that, as a result of an application of a common N-gram to disfluencies, accuracy of predicting a word succeeding a disfluency was improved as compared with the transparent word model. Nevertheless, since the nature of a disfluency can be empirically different from that of a normal word, as is clear from the above explanation of the transparent word, other solutions can yield more accurate results than modeling word sequences including disfluencies as a simple sequential series.
Conventional speech recognition systems commonly used for dictation purposes, on the other hand, often employ a method of interpolation between two or more different language models. This technique can be used when a general purpose model serving as a base model cannot effectively deal with texts peculiar to a specific field of activities such as computer, sports, and so on. In such a case, a language model of a specific field of activities, having learned about texts peculiar to the field in question, can be employed in combination with the language model for general purposes. Using this approach, the probability calculation can be performed as follows:
 PR(w[n]|w[n−2],w[n−1])=fÉP1(w[n]|w[n−2],w[n−1])+(1−fÉ)P2(w[n]|w[n−2],w[n−1])
where P1 indicates a general purpose language model, P2 indicates a language model of a specific field, and fÉ is an interpolation coefficient, which can be set at an optimum value through experimentation.