1. Field of the Invention
The present invention relates generally to speech recognition, and more particularly to an improvement in the technique of calculating the mean value of each acoustic category that is necessary to effect speaker adaptation of input speech or reference patterns.
2. Description of the Related Art
Various different speech recognition techniques have been known depending on the nature and level of technology. The basic principles behind the existing speech recognition techniques are as follows: Utterances to be recognized are analyzed in a training or registering mode, and stored as reference patterns. An unknown utterance that is uttered by a speaker is analyzed in a recognition or testing mode, and the pattern produced as a result of the analysis is compared successively with the reference patterns. Then, a result that corresponds to one of the reference patterns which best matches the pattern is outputted as the recognized utterance.
Among various speech recognition systems, a speaker-independent speech recognition system is widely used, in which utterances of many speakers are registered as reference patterns to accommodate the distribution of the speaker individualities. Therefore, the speaker-independent speech recognition system is capable of recognizing utterances of an unknown speaker at a relatively high speech recognition rate regardless of speech sound variations in different speakers.
However, the speaker-independent speech recognition system is disadvantageous in that it cannot achieve a high performance if unknown utterances that are inputted are largely different from those registered as reference patterns. It is also known that the speech recognition rate of the system is degraded if a microphone used to record testing utterance is different from the microphones that were used to record utterances to provide reference patterns.
A technique which is known as "speaker adaptation" to improve the speech recognition rate has been proposed. The speaker adaptation process employs relatively few utterances provided by a specific speaker or a specific microphone to adapt reference patterns to the utterances. One example of the speaker adaptation method is disclosed by K. Shinoda et al. "Speaker Adaptation on Using Spectral Interpolation for Speech Recognition", Trans. of IEICE (Jap.), vol. J 77-A, No. pp. 120-127, February 1994 (hereinafter referred to as "literature 1").
A conventional speech recognition system used for speaker adaptation will be described below with reference to FIG. 1 of the accompanying drawings.
As shown in FIG. 1, the conventional speech recognition system comprises an analyzer 1 for converting input speech into a time sequence of feature vectors, a reference pattern memory 2 for storing reference patterns, i.e., a time sequence of feature vectors that have been converted from training utterances and contain weighting information for each acoustic category, a matching unit 12 for comparing the time sequence of feature vectors of input utterances and the reference patterns to determine an optimum path and a time-alignment between the input utterances and the reference patterns, a backtracking information memory 14 for storing two-dimensional information associated by the matching unit 12, a template information memory 16 for storing template information, i.e., the index information of a template that indicates which template has been used at respective grid points if the template is a multiple template having a plurality of reference patterns, and a mean vector calculator 18 for carrying out a backtracking process to determine which reference pattern is associated with the input speech at each time, based on the two-dimensional associated information stored in the backtracking information memory 14. Both the backtracking information memory 14 and the template information memory 16 have a two-dimensional storage area having a size of (length of input speech).times.(length of reference pattern).
The analyzer 1 may convert input speech into a time sequence of feature vectors according to any of various spectral analysis processes. These various spectral analysis processes include a method of employing output signals from a band-pass filter bank in 10 through 30 channels, a nonparametric spectral analysis method, a linear prediction coding (LPC) method, and a method of obtaining various multidimensional vectors representing short-time spectrums of input speech with various parameters including a spectrum directly calculated from a waveform by Fast Fourier Transform (FFT), a cepstrum which is an inverse Fourier transform of the logarithm of a short-time amplitude spectrum of a waveform, an autocorrelation function, and a spectral envelope produced by LPC.
Generally, feature vectors that are extracted as representing speech features from input speech using discrete times as a frame include power information, a change in power information, a cepstrum, and a linear regression coefficient of a cepstrum. Spectrums themselves and logarithmic spectrums are also used as feature vectors.
Speech of a standard speaker is analyzed and converted into a time sequence of feature vectors in the same manner as the analysis process employed by the analyzer 1, and the feature vectors are registered as reference patterns in units of isolated words, connected words, or phonemes in the reference pattern memory 2. Weighting information for respective categories to be classified is established in advance with respect to these reference patterns.
The matching unit 12 carries out a matching of dynamic time warping between the time sequence of the feature vectors of the input speech converted by the analyzer 1 and the reference patterns stored in the reference pattern memory 2. The matching algorithm between the two patterns is preferably one of the algorithms which take into account nonlinear expansion and contraction in the time domain because the time sequence of the input speech and the reference patterns are easily expanded and contracted in the time domain. The algorithms which take into account nonlinear expansion and contraction in the time domain include a DP (Dynamic Programming) matching method, a HMM (Hidden Markov Model) matching method, and so on. In the description given below, the DP matching which is widely used in the art of present speech recognition will be explained.
If it is assumed that symbols "i", "j" represent time frames (i=0 to I), (j=0 to J) of a respective input speech and a reference pattern, and the symbol "c" represents a vector component, then the time sequence of the feature vectors of input speech are indicated by X(i, c), and the time sequence of the reference pattern are indicated by Y(j, c).
The input speech and the reference patterns make up a two-dimensional space composed of grid points (i, j), and a minimum path of accumulated distances, among paths from a starting end (0, 0) to a terminal end (I, J), is regarded as an optimum association between the two patterns, and the accumulated distances are referred to as the distance between the patterns. According to speech recognition based on the DP matching, distances between the input speech and all the reference patterns are calculated, and the acoustic category of one of the reference patterns which gives a minimum distance is outputted as the result of speech recognition.
If the DP matching is carried out for adaptation or learning, then since a reference pattern and the speech to be compared are already limited, the DP matching has its object to determine a mean value of feature vectors in each acoustic category when an optimum time-alignment is obtained between two patterns, rather than speech recognition.
Distances d(i, j) between the vectors of the grid points (i, j) of the time sequence X(i, c) of the feature vectors of the input speech and the time sequence Y(j, c) of the feature vectors of the reference patterns are defined as follows: ##EQU1## where k represents a kth template at respective grid point. A distance for each grid point corresponds to the minimum one of the distances given by plural ks templates.
According to the DP matching, the accumulated distances D(i, j) associated with the grid points (i, j) are indicated by the following recursive equation: ##EQU2##
Specifically, accumulated distances D are calculated in a direction for the input speech to increase in time, using the grid point (0, 0) as a starting point and the initial value D(0, 0) as d(0, 0), and when accumulated distances up to the final grid point (I, J) are determined, an optimum matching path between the two patterns is considered to be determined.
The backtracking information that is stored in the backtracking information memory 5 is transition information B(i, j) of the respective grid points which is expressed as follows: ##EQU3## where argmin.sub.(j) represents the selection of any one of the values j, j-1, j-2 which gives D a minimum value, as the value of a j component.
The template information T(i, j) which is stored in the template information memory 16 is represented by: ##EQU4##
The backtracking process that has heretofore been carried out by the conventional mean vector calculator 18 will be described below with respect to a simple example where the number of acoustic categories to be classified is 2, i.e., input speech is divided into a noise portion and a speech portion, and their mean values are determined.
If the mean values of noise and speech portions are indicated respectively by N(c), S(c), then the mean values in the respective acoustic categories back along the optimum path from a grid point (I, J) to a grid point (0, 0) are calculated as follows:
In a first step, the values of i, j, N(c), S(c) are set respectively to I, J, 0, 0 as follows:
i=I, PA0 j=J, PA0 N(c)=0, and PA0 S(c)=0. PA0 i=i-1, and PA0 j=B(i, j).
In a second step, the type of the acoustic category of the grid point (i, j) is checked. If it is a speech category, then S(c)=S(c)+X(i, c) is calculated, and if it is a noise category, then N(c)=N(c)+X(I, c) is calculated.
In a third step, the values of i and j are checked. If both are 0, then the processing jumps to a fifth step, and if i or j is not 0, then the processing proceeds to a fourth step.
In the fourth step, i is decremented by 1, and the transition information B(i, j) of the grid point (i, j) is put in j as follows:
Thereafter, the processing returns to the second step, and the second and following steps are repeated.
In the fifth step, the contents of N(c), S(c) are divided by the number of times which are respectively summed up, and the mean values in the respective acoustic categories are calculated. The processing is now completed.
In the conventional acoustic category mean value calculating apparatus, the backtracking process is carried out by going from a grid point position composed of a terminal end point of input speech and a terminal end point of a reference pattern back toward a starting end to associate the input speech and the reference pattern in a two-dimensional space. Mean vectors of the input speech are calculated in respective categories of the reference pattern that has been associated by the backtracking process, and outputted as acoustic category mean values.
Since the conventional acoustic category mean value calculating apparatus is required to search in the two-dimensional space in both the matching process that is executed by the matching unit 12 and the backtracking process that is executed by the mean vector calculator 18, the conventional acoustic category mean value calculating apparatus has been disadvantageous in that it needs a large amount of calculations and hence is not suitable for real-time operation. Furthermore, inasmuch as the backtracking process that is executed by the mean vector calculator 18 cannot be started unless the matching process that is executed by the matching unit 12 is finished, the backtracking process and the matching process cannot be executed simultaneously parallel to each other, i.e., they cannot be executed by way of so-called pipeline processing. This also makes the conventional acoustic category mean value calculating apparatus incapable of real-time operation.
Even if the number of acoustic categories to be classified is small, the conventional acoustic category mean value calculating apparatus necessarily needs a large memory as a two-dimensional storage area for carrying out the backtracking process. For this reason, it has been impossible to make the conventional acoustic category mean value calculating apparatus inexpensive.