1. Field of the Invention
The present invention relates to a pattern recognition system capable of improving recognition accuracy by combining posterior probabilities obtained from similarity values (or differences between reference patterns and input patterns) of input acoustic units or input characters in pattern recognition such as speech recognition or character string recognition and, more particularly, to a pattern recognition system in which an a priori probability based on contents of a lexicon is reflected in a posterior probability.
2. Description of the Related Art
Known conventional pattern recognition systems recognize continuously input utterances or characters in units of word or character sequences. As one of such pattern recognition systems, a connected digit speech recognition algorithm using a method called a multiple similarity (MS) method will be described below.
Continuously uttered input utterances in a system are divided into frames of predetermined times. For example, an input utterance interval [1, m] having 1st to m-th frames as shown in FIG. 1 will be described. In preprocessing of speech recognition, a spectral change is extracted each time one frame of an utterance is input, and word boundary candidates are obtained in accordance with the magnitude of the spectral changes. That is, a large spectral change can be considered a condition of word boundaries. In this case, the term "word" means a unit of an utterance to be recognized. The referred speech recognition system is composed of a hierarchy of lower to higher recognition levels, e.g., a phoneme level, a syllable level, a word level and a sentence level. The "words" as units of utterances to be recognized correspond to a phoneme, a syllable, a word, and a sentence at the corresponding levels. Word recognition processing is executed whenever the word boundary candidate is obtained.
In the word sequence recognition processing, the interval [1, m] is divided into two partial intervals, i.e., intervals [1, ki] and [ki, m]. ki indicates the frame number of the i-th word boundary candidate. The interval [1, ki] is an utterance interval corresponding to a word sequence wi, and the interval [ki, m] is a word utterance interval corresponding to a single word wi. A word sequence Wi is represented by: EQU Wi=wi+wi (1)
and corresponds to a recognition word sequence candidate of the utterance interval [1, m] divided by the i-th frame. The recognition word sequence candidates Wi are obtained for all the word boundary candidates ki (i=1, 2, . . . , l). Of these candidates thus obtained, a word sequence W having a maximum similarity value (value representing a similarity of this pattern with respect to a reference pattern) is adopted as a recognition word sequence of the utterance interval [1, m]. Note that l represents the number of recognition word sequence candidates corresponding to partial intervals to be stored upon word sequence recognition and is a parameter set in the system. By sequentially increasing m by this algorithm, recognition word sequences corresponding to all the utterance intervals can be obtained.
In the above continuous speech recognition method, the number of input words is unknown. Therefore, in order to correctly recognize an input utterance pattern L as a word sequence W, whether each detected interval correctly corresponds to an uttered word must be considered. Even if this is considered, it is difficult to obtain a high recognition rate in the word sequence recognition as long as the similarity values are merely combined. This is because the similarity is not a probabilistic measure.
Therefore, some conventional systems transform an obtained similarity value into a posterior probability and use this posterior probability as a similarity measure for achieving higher accuracy than that of the similarity.
Assume that speech recognition is to be performed for an input word sequence ##EQU1## including n words belonging to word set C ={cl, c2, . . . , cN} so as to satisfy the following two conditions:
(1) A word boundary is correctly recognized.
(2) The word category of each utterance interval is correctly recognized.
In this case, as shown in FIG. 2, assume that each word wi corresponds to a pattern li in each partial utterance interval to satisfy the following relation: EQU L=l1 l2 . . . ln
In this case, if the word sequence W has no grammatical structure, wi and wj can be considered independent events (i.noteq.j). Hence the probability that each utterance interval is correctly recognized to be a corresponding word is represented by the following equation: ##EQU2## In this equation, P(W.vertline.L) is called likelihood. Upon calculation of the P(W.vertline.L), in order to prevent repetition of multiplication, logarithms of both sides of equation (2) are often taken to obtain logarithmic likelihood as follows: ##EQU3## In this equation, P(wi.vertline.li) is a conditional probability that an interval li corresponds to wi and is a posterior probability to be obtained.
Therefore, by transforming an obtained similarity value into a posterior probability by a table, a high recognition rate can be obtained.
Since it is practically difficult to obtain the posterior probability P(wi.vertline.li), however, a similarity value is normally used instead of a probability value, while properly biasing the similarity value to make it approximate to a probability value. For example, Ukita et al. performed approximation by an exponential function as shown in FIG. 3 ("A Speaker Independent Recognition Algorithm for Connected Word Boundary Hypothesizer," Proc. ICASSP, Tokyo, April, 1986): ##EQU4## A logarithm of the equation (4) is calculated and the relation A.multidot.B.sup.Smax =1.0 is utilized to obtain the following equation: ##EQU5## By subtracting a fixed bias Smax from similarity S, a similarity value is transformed into a probability value. When this measure is used in connected digit speech recognition, the bias Smax is set to be 0.96.
A posterior probability curve, however, is not generally a fixed curve but a variable one depending on a size of a lexicon or the contents of the lexicon (e.g., the number of similar words is large). Therefore, the conventional method of transforming a similarity value into a posterior probability on the basis of only one fixed curve as described against many applications cannot perform recognition with high accuracy.
As described above, in the conventional pattern recognition system for estimating similarity by transforming the similarity into a posterior probability, a transformation curve for obtaining the posterior probability is approximated to a fixed curve because it is difficult to obtain a curve corresponding to the contents of a lexicon or the number of words. Therefore, recognition cannot be performed with high accuracy.