1. Field of the Invention
The present invention relates to a speaker adaptation device which selects one of a plurality of prepared standard patterns on the basis of the speech characteristics of a speaker, as well as to a speech recognition device which recognizes speech through use of the thus-selected speaker-dependent standard pattern.
2. Description of the Related Art
As shown in FIG. 7, in a speaker adaptation device described in, for example, Kosaka et al., "Structured Speaker Clustering for Speaker Adaptation" (Technical Report by Electronic Information Communications Association, SP 93 to 110, 1993), voice feature quantity extraction means 1 subjects a speaker's voice 101, which will be separately input, to an acoustic feature quantity analysis, thereby extracting feature vector time-series data Ou=[ou(1), ou(2), . . . , ou(Tu)] (where Tu represents the maximum number of speaker voice frames). Speaker-dependent standard pattern selection means 6a selects and outputs as a speaker-dependent standard pattern 104 a speaker-dependent standard pattern which has the maximum probability of matching the speaker's voice 101, selects a reference speaker-dependent standard pattern from reference speaker-dependent standard pattern storage means 9, and subjects the thus-selected reference speaker-dependent standard pattern to hidden Markov model (HMM) probability computation, through use of the feature vector time-series data extracted by the speech feature quantity extraction means 1. Reference speaker-dependent standard pattern learning means 7 generates reference speaker-dependent standard patterns .lambda.s (1) to .lambda.s(M) for reference speaker numbers 1 to M, through use of a reference speaker speed data feature vector 102 and an initial standard pattern 103, which are prepared separately. With the reference speaker-dependent standard patterns .lambda.s (1) to .lambda.s(M), an adaptive mean vector .mu.al(j,k) is estimated and learned from the speech data regarding a speaker 1, with regard to a k-th HMM mean vector .mu.I(j,k) in state "j," which is the initial standard pattern 103, by means of a transfer-vector-field smoothing speaker adaptation method (for further information about the method, see Okura et al., "Speaker Adaptation Based on Transfer Vector Field Smoothing Model with Continuous Mixture Density HMMs", Technical Report by Electronic Information Communications Association, SP 92 to 16, 1992). Reference speaker-group-dependent pattern learning means 8 defines and clusters the distance among the reference speaker-dependent standard patterns .lambda.s(1) to .lambda.s(M) produced by the reference speaker-dependent standard pattern learning means 7, by means of a Bhattacharyya distance to thereby produce reference speaker-group-dependent standard patterns .lambda.g(1) to .lambda.g(N) for reference speaker group numbers 1 to N, through use of reference speaker-dependent standard patterns which are grouped by means of, e.g., K-mean algorithm (for further information about the algorithm, see L. Rabiner et al., "Fundamentals of Speech Recognition," translated by Kei FURUI, NTT Advanced Technology Company Ltd., 1995). Reference speaker-dependent standard pattern storage means 9 stores the reference speaker-dependent standard patterns .lambda.s(1) to .lambda.s(M) produced by the reference speaker-group-dependent standard pattern learning means 7 and the reference speaker-group-dependent standard patterns .lambda.g(1) to .lambda.g(N) produced by the reference-speaker-dependent standard pattern learning means 8.
The conventional speaker adaptation device adopts a speaker adaptation method (a speaker adaptation method based on standard pattern selection). Under this method, a plurality of reference speaker-dependent standard patterns are prepared beforehand, through use of a hidden Markov model [HMM, or an speaker independent standard pattern which is described in detail in, e.g., "Fundamentals of Speech Recognition" and is prepared beforehand from speech data regarding an speaker-independent speaker (such as words or sentences) through standard pattern learning operations]. A speaker-dependent standard pattern is selected on the basis of the characteristics of the speaker's speech.
The reference-speaker-group-dependent standard pattern learning means 8 estimates the k-th mean vector .mu.gn (j,k) and a covariance matrix Ugn (j,k) about group "n" which is in state "j" with regard to the generated reference-speaker-group standard pattern, by means of Equation 1 provided below. Here, .mu.gn(j,k) represents the i-th mean vector in the group "n" with regard to the reference speaker-dependent standard pattern, and uai (j,k) represents a covariance matrix. Further, I represents the number of reference speaker-dependent standard patterns in the group "n," and "t" represents a transposed matrix. ##EQU1##
The reference speaker-dependent standard pattern storage means 9 uses an HMM having an initial HMM Gaussian distribution number of 810 whose mean vector dimension number is 34 per-standard pattern. For example, with regard to a standard pattern number of 484 which is a sum of a reference speaker-dependent standard pattern number of 279 and a reference speaker-group-dependent standard pattern number of 205, there must be stored 13,329,360 data sets (=484.times.810.times.34) for merely a mean vector.
The speaker's voice 101 corresponds to the voice produced as a result of a speaker using the system speaking predetermined words or sentences beforehand.
The reference speaker speech data feature vector 102 corresponds to a feature vector (e.g., a physical quantity expressing the voice characteristics in a small amount of data, such as Cepstrum or a Cepstrum differential) which is extracted by subjecting multiple speaker voice data to an acoustic feature quantity analysis. In the case of the number of reference speakers being M, there are feature vector time-series data O(1) to O(M) [O(1) designates time-series signals {o (1,1), o (1,2), . . . , o (1,T1)}, where T1 is the number of speech data frames of a reference speaker 1].
The initial standard pattern 103 corresponds to an initial standard pattern .lambda.I[e.g., 200 states (5 mixture/state) phoneme HMM and 1 state (10 mixture) silent HMM] prepared beforehand.
For example, as shown in FIG. 8, in the common speech recognition device which uses a conventional speaker adaptation method based on standard pattern selection, the voice feature quantity extraction means 11 operates for a speaker's voice 101a to be recognized (i.e., the voice produced as a result of a speaker using the system speaking words and sentences to be recognized), which will be input separately, in the same manner as used by the voice feature quantity extraction means 1 shown in FIG. 6A. Matching means 12 recognizes speech from the feature vector time-series data produced by the voice feature quantity extraction means 11, by comparison of the time-series data with the speaker-dependent standard pattern 104 produced by the speaker adaptation device based on the standard pattern selection method.
Compared with a speaker adaptation method based on a mapping method [a structural model introduction method regarding a personal error under which a mapping relationship derived between an initial standard pattern and a speaker's standard pattern by means of a small amount of learned data, e.g., a specific standard pattern learning method which uses a conversion factor obtained by means of a multiple regression mapping model and which is described in "M. J. F. Gales et al. , Means and Variance Adaptation within the MLLR Framework," Computer Speech and Language 10, pp. 249 to 264, 1996] or the speaker adaptation method based on a Statistical Estimation Method [e.g., a method of utilizing knowledge based on a previously-obtained initial standard pattern at the time of estimation of a standard pattern from newly-acquired learned data, the method being described in "C. H. Lee et al., A Study on Speaker Adaptation of the Parameters of Continuous Density Hidden Markov Models, IEEE Transaction, Signal Processing, Vol. 39, No. 4, pp. 806 to 814, 1991], the conventional speaker adaptation device based on the standard pattern selection method enables correct adaptation of a speaker through use of a smaller amount of learning data. However, if there is an increase in the number of standard patterns to be stored during speaker adaptation--in which a speaker-dependent standard pattern is selected from reference speaker-dependent standard patterns on the basis of the speaker's voice--as a natural result there is an increase in the amount of data representing the reference speaker-dependent standard pattern.