1. Field of the Invention
The present invention relates to a method and an apparatus for both speaker clustering and speaker adaptation based on the HMM model variation information. In particular, the present invention includes a method and an apparatus that yield an improved performance of automatic speech recognition in that it utilizes the average of model variation information over speakers. In addition, the present invention does not analyze only information on the quantity variation amount of model variation, but also analyzes information with respect to the directional variation amount.
2. Description of the Related Art
A speech recognition system is based on the correlation between speech and its characterization in an acoustic space for the speech. The characterization is typically obtained from training data.
The Speaker-Independent (SI) system is trained using a large amount of data acquired from a plurality of speakers, and acoustic model parameters are obtained as averages of speaker differences, yielding a limited modeling accuracy for each individual speaker. On the other hand, a Speaker-dependent (SD) system is trained by an adequate amount of speaker-specific data and shows a better performance than the SI system. However, the SD system has drawbacks in that collecting a sufficient amount of data for each single speaker, in order to properly train the acoustic models, is time consuming and unacceptable in many cases. As a compromise, a Speaker Adaptation (SA) system attempts to tune the available recognition system to a specific speaker to improve recognition performance while requiring only a little amount of speaker-specific data.
FIG. 1 illustrates a general speaker adaptation method that utilizes a Maximum Likelihood Linear Regression (MLLR) technique by which speaker adaptation may be achieved using a minimized amount of data.
If a speaker says, “It is said that it is going to rain today” (S101), the utterance is converted into series of feature vectors, and then feature vectors are aligned with HMM states using the Viterbi alignment (S103). Then, a class tree configured using characteristics of models in an acoustic model space is used (S105), and a model transformation matrix is then estimated to transform the canonical model into a model suitable for a specific speaker (S107).
Herein, the basic unit of each model is a subword. In the class tree, the base classes C1, C2, C3 and C4 are connected to upper nodes C5 and C6 according to their phonological or aggregative characteristics in the acoustic model space. Accordingly, although a node C1 having data that are not sufficient to estimate a transformation matrix using a minimized number of utterances is generated, since a model of a cluster C1 may be transformed using the transformation matrix estimated at the upper node C5, speaker adaptation may be achieved with a minimized number of data.
A class configuration method using a phonological knowledge base and aggregative characteristics of acoustic model space is suggested in C. J. Leggetter, “Improved Acoustic Modeling for HMMs using Linear Transform” Ph. D thesis, Cambridge University, 1996 “Regression Class Generation based on Phonetic Knowledge and Acoustic Space”. Such a method is, however, lacking in a mathematical basis and logic to support the hypothetical that phonemes of similar speech methods are located in a similar region in the acoustic model space. Additionally, there is a cluster difference between models before and after a speaker adaptation, but the method ignores the cluster difference. In other words, when clustering is performed using only a dispersion of models in an acoustic model space of a speaker-independent model before a speaker adaptation, models belonging to an arbitrary cluster may shift to other clusters after adapting to a speaker. Herein, since an identical parameter is applied to an identical cluster, speaker adaptation is resultantly performed in such shifted models by an erroneous transformation matrix.
In the meantime, the performance of a speaker adaptation system may be enhanced using a speaker clustering method for constituting acoustic models separately for each speaker group having a similar model dispersion in the acoustic model space.
U.S. Pat. No. 5,787,394, “State-dependent speaker clustering for speaker adaptation” discloses a speaker adaptation method that uses speaker clustering. According to the method of U.S. Pat. No. 5,787,394, the likelihood of all speaker models is analyzed when a speaker model cluster that is the most similar to a test speaker is selected. Thus, when the model similar to the test speaker model is not found in the selected speaker model cluster, a new prediction should be performed using another speaker cluster model. Accordingly, the amount of calculation is significant, and the calculation speed is also decreased. In addition, according to the method of U.S. Pat. No. 5,787,394, when a speaker model cluster that is most similar to a maximum likelihood (hereinafter, referred to as ML) model of a test speaker is selected, only a quantity variation amount is analyzed between the compared models, and the directional variation amount is disabled. Thus, even if the directional variation amounts are different from each other, if the quantity variation amounts are identical, the models may be bound in the same cluster.