The invention relates to a method of determining an acoustic model for a word as a sequence of reference values which are required for the later recognition of words from a speech signal.
For the recognition of words in a speech signal, test signals are temporally successively derived therefrom and compared with reference values; sequences of reference values then represent different words and the reference values which are most similar to the test signals indicate the word recognized with the highest probability. During a preceding training phase, the reference values are derived from the speech signal of a known text wherefrom test signals are derived in the same manner; these test signals are first combined so as to form characteristic values which represent acoustic states in the known words. Each phoneme in the word has a sequence of acoustic states and the characteristic values observed for the individual acoustic states are combined for the same phonemes, that is in dependence on the preceding and the subsequent phoneme which constitute a triphone in conjunction with the central phoneme. In conformity with the representation of the sequence of the acoustic states from left to right on the time axis, the phonemes of a triphone are also referred to as left-hand phoneme, central phoneme and right-hand phoneme. When the characteristic values for the individual acoustic states in the triphones of the known words have been determined, the characteristic values of the same acoustic states of different triphones which satisfy predetermined criteria are combined so as to form step-wise larger groups. Such a criterion is notably the distance between characteristic values in the characteristic space; more specifically, groups of characteristic values in which the distance between the characteristic values which are situated furthest apart is less than a predetermined distance are combined so as to form a larger group. The number of observations in each characteristic value represents a further criterion. If this number is too small, such a characteristic value is combined with the nearest group so as to form a larger group.
When all groups have been formed in this manner, the word models for the recognition must be generated. Each word model consists of a sequence of reference values which describe the word. It is to be noted that in known speech recognition methods, notably for a large vocabulary, a reference word does not represent a single value but a distribution function. The consideration of triphones means that for the central phoneme in each triphone the left-hand and the right-hand interrelationship are taken into account, because this interrelationship has a pronounced effect on the pronunciation of the phoneme.
During the training phase a text of limited length, in which not all triphones occurring in a language are contained, is spoken. This is inter alia because of the fact that not all words of the vocabulary are spoken in the text of limited length. However, such words must nevertheless be modeled in order to ensure that in the subsequent recognition phase they can also be recognized when they occur in the speech signal. It may be assumed that all phonemes have occurred during the training phase. However, if a phoneme occurs as the central part of a triphone in a word which has not been spoken during the training phase, and this triphone has not been spoken during the training phase, this triphone or this phoneme cannot be simply modeled in the framework of this triphone. One possibility of eliminating this difficulty consists in replacing a phoneme in a non-trained triphone by a reference value which is derived from the possibly weighted mean value of all triphones containing this phoneme which have occurred in the training phase. However, this yields poor modeling and gives rise to an increased recognition error rate.
Therefore, it is an object of the invention to provide a method which also offers suitably exact modeling of words which contain at least one triphone which has not occurred during the training phase.
This object is achieved essentially in that for such a non-observed triphone there is selected a group which is associated with the same phoneme in interrelationship with the same left-hand or the same right-hand phoneme. Thus, the similarity of a triphone with a given left-hand interrelationship and a given right-hand interrelationship is determined for the respective other interrelationship, since most of the two interrelationships actually occur also when a text of limited length is used during the training phase. This association with groups is performed separately, for each acoustic state in this triphone to be modeled or, more accurately speaking, for each acoustic state of the central phoneme within the relevant triphone, there only being selected groups which belong to the same acoustic state in the triphones observed. The interrelationship of a phoneme within a triphone can thus be substantially accurately determined so that a quite appropriate sequence of reference values can thus be formed also for words which have not occurred during the training phase.
The same phoneme with the same left-hand or the same right-hand phoneme and a different phoneme at the respective other side may very well occur in different groups. Because in that case a group cannot be unambiguously selected directly, the group is selected which contains the largest number of left-hand or right-hand interrelationships corresponding to the triphone to be modeled. As a result, the probability that correct modeling will be found becomes very high.
As has already been stated, for each state of the non-observed triphone to be modeled there is only selected a group which is associated with the same state. Because for the first states in the triphone to be modeled, being situated to the left of the central phoneme in the representation on the time axis, the effect of the left-hand phoneme will be stronger than that of the right-hand phoneme in the triphone, whereas for the last states at the right-hand side the situation will be reverse, in a further embodiment of the invention for the first states in the triphone to be modeled the number of interrelationships with the same central phoneme and the same left-hand phoneme is advantageously increased by a fixed value in order to accentuate their effect. The same holds for the last states for which the number of interrelationships with the same central phoneme and the same right-hand phoneme is increased by a fixed value.
However, it may also occur that for given triphones there is no group which is associated with the same central phoneme in conjunction with the same left-hand or right-hand phoneme as the triphone to be modeled. In this case the rules for searching a group must be changed. One possibility consists in searching, for a given state of the triphone to be modeled, a group which is not associated with the same state but with a different state. In other words, the state can be completely ignored while searching for an appropriate group. Another, possibly additional possibility consists in examining all groups which contain the left-hand or the right-hand phoneme in interrelationship with an arbitrary central phoneme, the numbers of such triphones being weighted. This modeling is not as good as the previously described modeling, but still yields usable reference values for words with non-trained triphones, resulting in a low recognition error rate.
The problem addressed by the invention and its solution can be expressed in mathematical terms as follows:                               K          ^                =                  xe2x80x83                ⁢                  arg          ⁢                      xe2x80x83                    ⁢                                    max              K                        ⁢                                          ∑                                  tr                  ∈                  K                                            ⁢                              P                ⁡                                  (                                      tr                    |                                          tr                      xe2x80x2                                                        )                                                                                            =                  xe2x80x83                ⁢                  arg          ⁢                      xe2x80x83                    ⁢                                    max              K                        ⁢                                          ∑                                  tr                  ∈                  K                                            ⁢                              P                ⁡                                  (                                      tr                    ,                                          tr                      xe2x80x2                                                        )                                                                        
This means that that group R is to be found for which the probability that it contains the triphone trxe2x80x2 to be modeled is the highest from among all groups K, given certain triphones in this group. This probability can be rewritten as follows:
P(tr, trxe2x80x2)≈Pc,s(r,rxe2x80x2)xc2x7Pc,s(l,lxe2x80x2)
This leads to the described solution of searching the group which exhibits the same left-hand phoneme or the same right-hand phoneme, 1 and r, respectively, for the same central phoneme c and the same state.