1. Field of the Invention
The present invention relates to a speaker normalization processor apparatus and a speech recognition apparatus together with the speaker normalization apparatus, and in particular, to a speaker normalization processor apparatus for generating a speaker-normalized optimal hidden Markov model (hereinafter, a hidden Markov model will be referred to as an HMM) based on speaker-normalizing speech waveform data of a plurality of training speakers, using a function for normalizing input frequencies to be directed to average Formant frequencies and by then training an initial HMM based on the speaker-normalized speech waveform data, and also relates to a speech recognition apparatus for performing speech recognition by using the generated HMM.
2. Description of the Prior Art
Conventionally, as a technique for speaker normalization, a speaker normalization technique using frequency warping with attention focused on vocal tract length (hereinafter, referred to as a prior art example) has been proposed, and its effectiveness has been reported (See, for example, Prior Art Document 1, P. Zhan et al., "Speaker Normalization Based on Frequency Warping", Proceeding of ICASSP, pp. 1039-1042, 1997). The speaker normalization technique based on the likelihood in this prior art example is a method comprising the steps of, using a plurality of frequency warping functions prepared in advance, performing frequency warping using these functions and then acoustic analysis, determining resultant likelihoods at which acoustic parameters are outputted from an initial acoustic model, and selecting the warping function having the highest likelihood. Hereinbelow, the method of selecting an optimal frequency warping function based on the likelihood as well as the procedure for speaker normalization training are explained.
First of all, the method of selecting a frequency warping function will be explained. In this case, as shown in FIG. 17, a frequency warping function optimal to each speaker is selected from a plurality of N frequency warping functions F .epsilon. f.sub.1, f.sub.2, . . . , f.sub.N according to the following procedure:
(A1) Feature extractors 31-1 to 31-N perform frequency warping process for speech waveform data of one speaker m, using the frequency warping functions F .epsilon. f.sub.1, f.sub.2, . . . , f.sub.N prepared in advance, and then, perform acoustic analysis;
(A2) A likelihood calculator 32 determines a likelihood by Viterbi search using correct-solution phoneme series with a lookup to a predetermined phoneme HMM 33 with respect to each of acoustic analysis results obtained by above (A1);
(A3) A maximum likelihood selector 34 selects a frequency warping function f.sub.max that gives a maximum likelihood among the frequency warping functions f.sub.1, f.sub.2, . . . f.sub.N based on results of above (A2); and
(A4) A feature extractor 35 performs frequency warping process for inputted speech waveform data of the speaker m using the frequency warping function f.sub.max, and then, acoustic analysis, thereby outputting normalized feature parameters. These feature parameters are used for, for example, speech recognition.
Next, the procedure for speaker normalization training will be explained. It is assumed here that, for the training, two different speech data sets, speech data for the selection of a frequency warping function and speech data for training, are used.
(B1) Acoustic analysis of speech waveform data for adaptation or training of all the training speakers is performed, by which acoustic feature parameters are obtained. For these acoustic feature parameters, mel-frequency cepstrum coefficients or the like, which have been known to those skilled in the art, is used;
(B2) The frequency warping function f.sub.max that gives a maximum likelihood on the speech data for the selection of a frequency warping function of each training speaker is selected based on a trained acoustic model .LAMBDA..sub.i ;
(B3) Frequency warping using the frequency warping function selected for each speaker, and then, acoustic analysis of the speech data for training, are performed, by which the acoustic feature parameters are determined;
(B4) The acoustic model .LAMBDA..sub.i is trained based on acoustic analysis results obtained by above (B3); and
(B5) Then, the process of (B2)-(B4) is repeated to a designated number of times.
FIG. 18 is a graph showing examples of frequency warping functions in the prior art example. The function shown in FIG. 18 represents the correspondence between frequencies before and after performing the frequency warping by a linear frequency warping function determined by a frequency warping coefficient .alpha.. With a coefficient .phi. determined, if the normalized frequency f of input speech is not more than .phi., the frequency warping function is given by the following equation: EQU f'=.alpha..multidot.f for 0&lt;f.ltoreq..phi. (1),
and when the frequency f of input speech is within a range of .phi. to one, the frequency warping function is given by the following line that interconnects coordinates (.phi., f.multidot..phi.) and coordinates (1.0, 1.0) shown in FIG. 18:
f'={(.alpha..multidot..phi.-1).multidot.f-(.alpha.-1).multidot..phi.}/ (.phi.-1) for .phi.&lt;f.ltoreq.1.0. (2)
For the execution of speaker normalization, a plurality of frequency warping functions different in this frequency warping coefficient a from one another are prepared, and among those, a frequency warping function having a maximum likelihood is selected. The terms "frequency warping" is referred herein to as a process of shifting each frequency of speech waveform data of one target speaker to its corresponding average frequency of all the speakers by using, for example, the frequency warping functions of FIG. 18.
However, for the method of the prior art example, it is necessary to previously specify the configuration of the frequency warping function. Also, since the frequency warping coefficient .alpha. is given as a discrete value, there has been a problem that detailed frequency warping functions could not be estimated. Further, when speech recognition is performed using an HMM speaker-normalized and trained by the speaker normalization method of the prior art example, there has been a problem that significant improvement in the speech recognition rate by normalization could not be obtained.