1. Field of the Invention
The present invention relates to a speech recognition apparatus, and more particularly to a speech recognition apparatus which has an improved speaker adaptation function.
2. Description of the Related Art
As a conventional speaker adaptation system, a thesis entitled "Speaker Adaptation Which Makes Use of Prior Knowledge Regarding Correlation of Movement Vectors" in the Collection of Lecture Papers of the Autumn Meeting for Reading Research Papers in 1997, the Acoustic Society of Japan, Separate Volume I, pp. 23-24, September, 1997 is referred to.
FIG. 3 shows a speech adaptation system of a conventional speech recognition apparatus based on a hidden Markov model (HMM), and FIG. 4 shows a prior learning system of the conventional speech recognition apparatus of FIG. 3.
Referring to FIGS. 3 and 4, upon speaker adaptation, learning is performed by an HMM learning section 33 using adaptation utterances of a new speaker stored in an adaptation utterance storage section 31 and using speaker independent HMMs (hereinafter referred to as "SI-HMMs") stored in an SI-HMM storage section 32 in advance as initial models, and HMMs (hereinafter referred to as "BW-HMMs") obtained as a result of the learning are stored into a BW-HMM storage section 34.
A subtraction section 35 stores finite differences between parameters of the BW-HMM and the SI-HMM into a first finite difference storage section 36. Into the first finite difference storage section 36, only parameter finite differences of those HMMs which appear in the adaptation utterances. For example, if the adaptation utterances include three utterances of "a", "u" and "o", since a parameter of the HMM corresponding to the "a" and parameters of the HMMs corresponding to "u" and "o" are learned by the HMM learning section 33, finite differences between BW-HMMs and SI-HMMs for them are produced.
However, since "i" and "e" do not appear in the adaptation utterances, corresponding HMMs are not learned either, and parameters of the BW-HMMs remain same as the parameters of the SI-HMMs, the finite differences remain equal to 0.
An interpolation parameter storage section 37 stores interpolation parameters determined in prior learning (which will be hereinafter described).
An interpolation section 38 outputs second finite differences as linear sums of the interpolation parameters and the finite differences stored in the first finite difference storage section 36 so that the second finite differences may be stored into a second finite difference storage section 39.
The second finite differences calculated by the interpolation section 38 are finite differences between parameters of those HMMs which have not appeared in the adaptation utterances and parameter of the SI-HMMs.
In the example described above, finite differences regarding the HMMs of "i" and "e" are calculated as second finite differences.
A re-estimation parameter storage section 41 stores re-estimation parameters determined in prior learning which will be hereinafter described.
A re-estimation section 40 receives the re-estimation parameters and the first and second finite differences as inputs thereto, calculates third finite differences for all HMM parameters, and stores the third finite differences into a third finite difference storage section 42. In the example described above, the third finite differences are finite differences for parameters of all of the HMMs of "a", "i", "u", "e" and "o".
An addition section 43 adds the parameters of the SI-HMM and the third finite differences to determine specific speaker HMMs adapted to the new speaker and stores the specific speaker HMMs into an SD-HMM storage section 44.
Upon prior learning, specific speaker HMMs (SD-HMMs) of a large number of speakers are stored into the SD-HMM storage section 44, and finite differences (referred to as "third finite differences") between the parameters of the SD-HMMs of the individual speakers and the parameters of the SI-HMMs calculated by the subtraction section 47 are stored into the third finite difference storage section 42. Of the third finite differences, those third finite differences for the parameters of the HMMs which appeared in the adaptation utterances upon speaker adaptation are represented by "S", and the other third finite differences (those for the parameters of the HMMs which did not appear in the adaptation utterances) are referred to as "U".
An interpolation parameter learning section 45 determines the interpolation parameters so that the square sum of errors, which are differences U-U1 between linear sums (referred to as "U1") of the third finite differences S and the interpolation parameters and the third finite differences U, for the large number of speakers may be minimum, and stores the determined interpolation parameters into the interpolation parameter storage section 37.
Then, the linear sums of the determined interpolation parameters and the third finite differences S are outputted as second finite differences so that they are stored into the second finite difference storage section 39.
A re-estimation parameter learning section 46 determines the re-estimation parameters so that the square sum of errors, which are differences U-U3 between linear sums (referred to as "U3") of the second finite differences and the re-estimation parameter and the third finite differences U, for the large number of speakers may be minimum, and stores the re-estimation parameters into the re-estimation parameter storage section 41.
The conventional speech recognition apparatus described above, however, has the following problems.
The first problem resides in that, upon speaker adaptation, interpolation and re-estimation are performed using finite differences (first finite differences) between BW-HMMs produced using adaptation utterances of a new speaker stored in the adaptation utterance storage section and SI-HMMs, but in prior learning for determination of interpolation parameters and re-estimation parameters, only SD-HMMs of a large number of speakers are used to perform learning.
In particular, in prior learning, first finite differences which are used upon speaker adaptation are not used, but third finite differences are used in substitution. Where the number of words of adaptation utterances is sufficiently large, since the SD-HMMs and the BW-HMMs substantially coincide with each other, this substitution is good approximation.
However, in speaker adaptation, it is the most significant subject to minimize the number of words of adaptation utterances. This reduces the burden to utterances of the user.
Where the number of words of adaptation utterances is small, since parameters of the SD-HMMs and the BW-HMMs are significantly different from each other, the approximation accuracy in such substitution as described above upon prior learning (that is, substitution of the first finite differences by the third finite differences) is very low, and it is difficult to estimate interpolation parameters or re-estimation parameters with a high degree of accuracy.
The second problem resides in that, in order to perform speaker adaptation, two linear transforms of interpolation and re-estimation are performed using a single finite difference (stored in the first finite difference storage section).
Where the number of words of adaptation utterances is small, the ratio of HMMs appearing in the utterances is very small. Therefore, it is inevitable to estimate (finite differences of) parameters of the greater part of HMMs by linear interpolation, particularly by linear transform of (finite differences of) parameters of a small number of HMMs which actually appear, and consequently, the accuracy of the second finite difference is very low.
Further, also parameters of those HMMs which have appeared in adaptation utterances are modified by re-estimation using finite differences (second finite differences having a low accuracy) of parameters of a large number of HMMs which have not appeared. Therefore, also the parameters of HMMs which have appeared in the adaptation utterances are deteriorated.