The present invention relates to a speech recognition based on adaptation techniques in speech recognition and, more particularly, to techniques of improving the recognition performance by effecting adaptation of the difference between an input speech and a speech reference pattern.
It is well known in the art that the recognition efficiency of a speech is reduced due to the character differences between the speech and the speech reference pattern. Among these differences, particularly those which are significant speech recognition efficiency reduction, causes are largely classified into two types. In one of these types, the causes are attributable to the environments in which the speech is produced by the speaker. In the other type, the causes are attributable to the speech of the speaker himself or herself. The environmental causes are further classified into two categories. One of these categories is predicated in background noises or like additive noises which are introduced simultaneously with the speaker's speech and additively affect the speech spectrum. The other cause category is predicated in line distortions, such as microphone or telephone line transmission characteristics distortions, which distort the spectrum itself.
Various adaptation methods have been proposed to cope with the character differences which are attributable to the speech environments. One such adaptation method aims at coping with the two environmental cause categories, i.e., the additive noises and the line distortions, to prevent the environmental speech recognition efficiency reduction. As an example, a speech adaptation system used for the speech recognition system is disclosed in Takagi, Hattori and Watanabe, "Speech Recognition with Environmental Adaptation Function Based on Spectral Copy Images", Spring Proceedings of the Acoustics Engineer's Association, 2-P-8, pp. 173-174, March 1994 (hereinafter referred to as Reference No. 1).
FIG. 4 shows the speech adaptation system noted above. The method disclosed in Reference No. 1 will now be described in detail. An input speech which has been distorted by additive noises and transmission line distortions, is converted in an analyzing unit 41 into a time series of feature vectors. A reference pattern storing unit 43 stores, as a word reference pattern, time series data of each recognition subject word which is obtained by analyzing a training speech in the same manner as in the analyzing unit 41. Each word reference pattern is given beforehand labels discriminating a speech section and a noise section. A matching unit 42 matches the time series of feature vectors of the input speech and the time series of word reference patterns in the reference pattern, and selects a first order word reference pattern. It also obtains the correlation between the input speech and the word reference patterns thereof with respect to the time axis. From the correlation between the first order word reference patterns and the input speech feature vectors (pattern) obtained in the matching unit 42, an environment adapting unit 44 calculates the mean vectors of the speech and noise sections of the input speech and each word reference pattern. The speech and noise section mean vectors of the input speech are denoted by S.sub.v and N.sub.v, and the speech and noise section mean vectors of the word reference patterns are denoted by S.sub.w and N.sub.w. The environment adapting unit 44 performs the adaptation of the reference patterns by using the four mean vectors based on Equation 1 given below. The adapted reference patterns are stored in an adapted reference pattern storing unit 45. EQU W'(k)={(S.sub.v -N.sub.v)/(S.sub.v -N.sub.v)}(W(k)-N.sub.v)+N.sub.v( 1)
where W(k) represents the reference patterns before the adaptation (k being an index of all the reference patterns), and W'(k) represents the adapted reference patterns. This adaptation permits elimination of the environmental difference between the reference patterns and the input speech and provision of a speech adaptation system, which is stable and provides excellent performance irrespective of input environment variations.
A different prior art adaptation technique which is commonly termed a speaker adaptation technique, has been proposed for the adaptation of the difference with respect to the speaker between a reference speaker's speech and a recognition subject speaker's speech to improve the speech recognition efficiency. This technique is disclosed in Shinoda, Iso and Watanabe, "Speaker Adaptation with Spectrum Insertion for Speech Recognition", Proceedings of the Electronic Communication Engineer's Association, A, Vol. J 77-A, No. 2, pp. 120-127, February 1994 (hereinafter referred to as Reference No. 2). FIG. 5 shows an example of the speech adaptation system employed in this technique. In the system, an analyzing unit 51 converts an input speech collected from the speaker having a different character from the reference speaker into a time series of feature vectors. A reference pattern storing unit 53 stores respective reference patterns which are obtained by analyzing a training speech of the reference speaker in the same manner as in the analyzing unit 51, and has time series multiplication procedures of recognition subject words. A matching unit 52 matches the input speech feature vector time series and each word reference pattern time series stored in the reference pattern storing unit 53, and selects the first order word reference patterns. It also obtains the correlation between the input speech and the word reference patterns with respect to the time axis. While in this embodiment the matching unit 52 selects the first order word reference patterns by itself (speaker adaptation without trainer), in the case of giving the first word reference patterns beforehand (speaker adaptation with trainer), the matching unit 52 may be constructed such that it obtains only the correlation between the input speech and the word reference patterns thereof with respect to the time axis. A speaker adapting unit 54 performs the following adaptation for each acoustical unit (or distribution according to Reference No. 2) on the basis of the correlation between the first order word reference patterns obtained in the matching unit 52 and the input speech feature vectors. The adapted vector .DELTA..sub.j for each distribution is obtained as shown below by using the mean value .mu..sub.j of reference pattern distribution j stored in the reference pattern storing unit 53 and the mean value .mu..sub.j ' with respect to the input correlated to the distribution j. EQU .DELTA..sub.j =.mu..sub.j '-.DELTA..sub.j ( 2)
For the distribution having no correlation of the reference pattern in the reference pattern unit 53, the adaptation is performed by using socalled spectrum insertion on the basis of the following Equation 3 which is described in the Reference No. 2. EQU .DELTA..sub.i =.SIGMA..sub.j W.sub.ij .DELTA..sub.j ( 3)
where j represents the category of the reference pattern, in which the acoustical category is present in the input speech. In effect, all the reference pattern distributions are adapted with respect to the speaker after either one of the two equations noted above. The adapted reference patterns are outputted from the speaker adapting unit 54 and stored in an adapted reference pattern storing unit 55.
The prior art speech adaptation system using the environmental adaptation as shown in FIG. 4, however, aims at the sole adaptation of mean environmental differences appearing in the speech as a whole, and is incapable of performing highly accurate adaptation for each acoustical unit such as the speaker adaptation. Theoretically, therefore, the system can not perform sufficient adaptation with respect to the speech, which is free from environmental differences and involves speaker differences alone.
The prior art speech adaptation system using the speaker adaptation as shown in FIG. 5, performs adaptation of differences appearing in the speech as a whole (mainly environmental causes) as well. The result of the adaptation thus retains both speaker differences and environmental differences. Where the speech to be adapted and the speech at the time of the speech recognition are different in the environment, therefore, a satisfactory result of adaptation can not be obtained due to the differences stemming from the environmental differences. A satisfactory result of adaptation also can not be obtained due to the environmental differences in the case where the speech to be adapted and those collected in various different environments are coexistent.