The invention relates to a method of automatically specifying a regression class tree structure for automatic speech recognizers with tree leaves representing word subclusters, and with tree nodes combining the word subclusters in dependence on a measure or the distances of the word subclusters in the acoustic space.
The regression class tree structure can be used for speech adaptation in automatic speech recognition systems, for example, dictation systems. A further possibility of use exists in the formation of the acoustic models in speech recognition systems.
A speaker adaptation in a priori speaker-independent speech recognizers is used for adaptation to a new speaker who does not belong to the speakers who were used for the (speaker-independent) training of the speech recognizer. A speaker adaptation may reduce the error rate of the speech recognizer, which rate is often unsatisfactory due to the only limited amount of training speech data. Depending on the amount of available adaptation speech data, the error rate of the speech recognizer, which can thus be adapted better to the respective speaker, will diminish. But also when there is only a small amount of adaptation data available will the speech recognizer be noticeably adapted to the respective speaker, i.e. have a recognizably reduced error rate.
From M. J. F. Gales xe2x80x9cThe generation and use of regression class trees for MLLR adaptationsxe2x80x9d, August 1996, Cambridge University (England), ftp address: svr-ftp.eng.cam.ac.ukxe2x80x94hereinafter referenced as [1]xe2x80x94it is known to use such regression class tree structures for speaker adaptation of speech recognizers which are a priori speaker-independent. The acoustic models of speech recognizers based on Hidden-Markov-Models (HMM) are then adapted by means of a linear transformation for which the HMM probability distributions are adapted. The transformation matrix used therefor is computed from the adaptation data by means of a Maximum Likelihood Estimate, i.e. by means of probability maximization. For the adaptation technique described it is a decisive point to suitably combine the word subclusters, referenced components in [1], of the basic speech corpus and associated Hidden-Markov-Models in clusters which are each assigned to exactly one transformation matrix. By means of the tree structure are determined regression classes that represent clusters of word subclusters. The tree leaves represent word subclusters which are to be considered basic regression classes. The tree nodes (which represent clusters of word subclusters) combine all the more word subclusters or regression classes as the tree nodes are closer to the tree roots. The regression classes used for the adaptation to the speaker are respectively determined by the number of available adaptation data. The more adaptation data are available, the closer lie the regression classes used for the speaker adaptation to the tree leaves and the more remote are they situated from the tree roots.
For the construction of the regression class tree structure, two approaches are described in [1]. The first estimate implies the use of expert knowledge with respect to the phonetic structure of the language used. Such knowledge is, however, not always readily available for all languages/corpora of languages. It is suggested, for example, to combine nasal sounds in one regression class. At a stage lying further below, i.e. further away from the tree root, for example, a subdivision into phones could be made. The second estimate has the effect that the combination of word subclusters and regression classes on which nearness to each other in the acoustic space is made dependent, irrespective of the phones they belong to. With this data-controlled estimate with an automatic construction of the regression class tree structure, no expert knowledge is necessary. However, the clusters found can no longer be assigned to phonetic classes (for example, nasals), i.e. a graphic interpretation of the classes is no longer possible. Both estimates are referenced in [1] as not unconditionally leading to optimum results. Recently, the probability of the adaptation data has been focused on maximizing them. A globally optimum tree structure can normally not be determined. However, a local optimization with respect to the determination of the individual tree nodes could be achieved.
It is an object of the invention to provide a data-driven estimate leading to a satisfactory error rate of the speech recognizer and linked to an automatic construction of the regression class tree structure.
The object is achieved in that the combination of regression classes to a regression class that lies closer to the tree root is made on the basis of a correlation parameter.
This estimate led to speech recognizer error rates which were very close to the error rates obtained when a regression class tree structure was used whose construction was not effected automatically but was exclusively based on expert knowledge.
A preferred embodiment of the method according to the invention comprises that, when the tree structure is initially determined, each word subcluster forms a basic regression class, that, subsequently, successive pairs of regression classes having the largest correlation parameter in the respective step are combined to a new regression class which is taken into account in the next steps of the formation of the tree structure instead of the two combined regression classes, until a regression class representing the tree root is formed. The tree structure is thus determined recursively based on the basic regression classes/word subclusters.
More specifically, there is provided that for determining the correlation parameter between two word subclusters, a correlation coefficient is formed in accordance with:       ρ    ij    =                              R          ij                                                    R              ii                                ⁢                                    R              jj                                          ⁢      with      ⁢              xe2x80x83            ⁢              R        ij              =                  1        M            ⁢                        ∑                      m            =            1                    M                ⁢                  xe2x80x83                ⁢                                            (                                                μ                  i                                      (                    m                    )                                                  -                                                      1                    M                                    ⁢                                                            ∑                                              m                        =                        1                                            M                                        ⁢                                          μ                      i                                              (                        m                        )                                                                                                        )                        T                    ⁢                      (                                          μ                j                                  (                  m                  )                                            -                                                1                  M                                ⁢                                                      ∑                                          m                      =                      1                                        M                                    ⁢                                      μ                    j                                          (                      m                      )                                                                                            )                              
with
i and j as indices for the two word subclusters which are still considered for a combination to a new regression class;
M as the number of speakers during the training of the speech recognizer;
xcexci(m) as the mean value vector for the ith word subcluster and xcexcj(m) as the mean value vector for the jth word subcluster, the components of the mean value vectors describing the mean values of output distributions of Hidden-Markov-Models used for describing the word subclusters, and in that, when the two word subclusters described by Hidden-Markov-Models are combined to a new regression class, for this new regression class an associated mean value vector is formed which is used for the respective calculation of further correlation coefficients relating to this new regression class and one or more other regression classes, by a linear combination of mean value vectors assigned to the two word subclusters.
Preferably, phonemes are provided as word subclusters. These phonemes lead as basic regression classes to tree structures which are particularly suitable for the speaker adaptation of speech recognizers. A further refinement of the tree structure is not necessary, usually because of the generally limited number of adaptation data.
A first preferred application of the regression class tree structure constructed with the method according to the invention comprises that the regression class tree structure is used for a speaker adaptation of a priori speaker-independent automatic speech recognizers and that their regression classes which, on the basis of the same adaptation data, combine Hidden-Markov-Models of word subclusters to be adapted, are used in dependence on the number of available speaker adaptation data.
A second use of the regression class tree structure constructed by the method according to the invention provides that, in dependence on the tree structure, context-dependent word subclusters are assigned to acoustic models while context categories on which the assignment is based are determined by means of the tree structure. Context-dependent word subclusters may be understood to be, for example, triphones. In a context category are then combined context phonemes for which it is assumed that they have the same or substantially the same influence on the pronunciation of a certain core phoneme. Such context categories are assumed to be, for example, the groups of vocals, plosives, fricatives, . . . In K. Beulen, H. Ney xe2x80x9cAutomatic question generation for decision tree based state tyingxe2x80x9d, ICASSP 1998 proceedings, pp. 805-808 (to be referenced [2] hereinafter), such context categories are assigned to phonetic questions by means of which the assignment of triphon-HMM-states to acoustic models incorporated in the speech recognizer is effected. These phonetic questions can now be easily determined by means of a regression class tree structure constructed with the method according to the invention.
The invention also relates to a speech recognition system whose speech recognition procedures use a regression tree structure constructed with the method according to the invention, more particularly in the framework of either of the two applications indicated.