The present invention relates to indefinite speaker's voice recognition system and method as well as acoustic model leaning method and recording medium with a voice recognition program recorded therein and, more particularly, to voice recognition system capable of normalizing speakers on frequency axis, learning system for normalization, voice recognition method, learning method for normalization and recording medium, in which a program for voice recognition and a learning program for normalization are stored.
Spectrum converters in prior art voice recognition systems are disclosed in, for instance, Japanese Patent Laid-Open No. 6-214596 (referred to as Literature 1) and Puming Zhan and Martin Westphalk, “Speaker Normalization Based on Frequency Warping”, ICASSP, 1039-1042, 1997 (referred to as Literature 2).
For example, Literature 1 discloses a voice recognition system, which comprises a frequency correcting means for correcting the frequency characteristic of an input voice signal on the basis of a plurality of predetermined different frequency characteristic correction coefficients, a frequency axis converting means for converting the frequency axis of the input voice signal on the basis of a plurality of predetermined frequency axis conversion coefficients, a feature quantity extracting means for extracting the feature quantity of the input voice signal as input voice feature quantity, a reference voice storing means for storing a reference voice feature quantity, a frequency characteristic correcting means, a frequency axis converting means, a collating or collating the input voice feature quantity obtained as a result of processes in the frequency characteristic correcting means and the reference voice feature quantity stored in the reference voice storing means, a speaker adopting phase function and a voice recognition phase function being included in the voice recognition system. In the voice recognition process in this system, in the speaker adopting phase an unknown speaker's voice signal having a known content is processed in the frequency characteristic correcting means, frequency axis converting means and feature quantity extracting means for each of the plurality of different frequency characteristic correction coefficients and the plurality of different frequency axis conversion coefficients, the input voice feature quantity for each coefficient and a reference voice feature quantity of the same content as the above known content are collated with each other, and a frequency characteristic correction coefficient and a frequency axis conversion coefficient giving a minimum distance are selected. In the voice recognition phase, the input voice feature quantity is determined by using the selected frequency characteristic correction coefficient and frequency axis conversion coefficient and collated with the reference voice feature quantity.
In these prior art voice recognition systems, for improving the recognition performance the spectrum converter causes elongation or contraction of the spectrum of the voice signal on the frequency axis with respect to the sex, age, physical conditions, etc. of the individual speakers. For spectrum elongation and contraction on the frequency axis, a function, which permits variation of the outline of the elongation and contraction with an adequate parameter, is defined to be used for elongation or contraction of the spectrum of the voice signal on the frequency axis. The function which is used for elongating or contracting the spectrum of the voice signal on the frequency axis is referred to as “warping function”, and the parameter for defining the outline of the warping function is referred to as “elongation/contraction parameter”.
Heretofore, a plurality of warping parameter values are prepared as elongation/contraction parameter of the warping function (“warping parameter”), the spectrum of the voice signal is elongated or contracted on the frequency axis by using each of these values, an input pattern is calculated by using the elongated or contracted spectrum and used together with reference pattern to obtain distance, and the value corresponding to the minimum distance is set as warping parameter value at the time of the recognition.
The spectrum converter in the prior art voice recognition system, will now be described with reference to the drawings. FIG. 9 is a view showing an example of the construction of the spectrum converter in the prior art voice recognition system. Referring to FIG. 9, this spectrum converter in the prior art, comprises an FFT (Fast Fourier Transform) unit 301, an elongation/contraction parameter memory 302, a frequency converter 303, an input pattern calculating unit 304, a matching unit 306, a reference pattern unit 305 and an elongation/contraction parameter selecting unit 307. The FFT unit 301 cuts out the input voice signal for every unit interval of time and causes Fourier transform of the cut-out signal to obtain a frequency spectrum.
A plurality of elongation/contraction parameter values for determining the elongation or contraction of frequency are stored in the elongation/contraction parameter memory 302. The frequency converter 303 executes a frequency elongation/contraction process on the spectrum fed out from the FFT unit 301 using a warping function with the outline thereof determined by elongation/contraction parameter, and feeds out a spectrum obtained after the frequency elongation/contraction process as elongation/contraction spectrum. The input pattern calculating unit 304 calculates and outputs an input pattern by using the elongation/contraction spectrum fed out from the frequency converter 303. The input pattern represents, for instance, a parameter time series representing an acoustical feature such as cepstrum.
The reference pattern is formed by using a large number of input patterns and averaging phoneme unit input patterns belonging to the same class by a certain type of averaging means. For the preparation of the reference pattern, see “Fundamentals of Voice Recognition”, Part I, translated and edited by Yoshii, NTT Advanced Technology Co., Ltd., 1995, pp. 63 (Literature 3).
Reference patterns can be classified by the recognition algorithm. For example, time series reference patterns with input patterns arranged in the phoneme time series order are obtainable in the case of DP (Dynamic Programming) matching, and status series and connection data thereof are obtainable in the HMM (hidden Markov Model) case.
The matching unit 306 calculates distance by using reference pattern 305 matched to the content of voice inputted to the FFT unit 301 and the input pattern. The calculated distance corresponds to likelihood in the HMM (hidden Marcov model)case concerning the reference pattern and to the distance of the optimum route in the DP matching case. The elongation/contraction parameter selecting unit 307 selects a best matched elongation/contraction parameter in view of matching property obtained in the matching unit 306.
FIG. 10 is a flow chart for describing a process executed in a prior art spectrum matching unit. The operation of the prior art spectrum matching unit will now be described with reference to FIGS. 9 to 10. The FFT unit 301 executes the FFT operation on voice signal to obtain the spectrum thereof (step D101 in FIG. 10). The frequency converter 303 executes elongation or contraction of the spectrum on the frequency axis by using input elongation/contraction parameter (D106) (step D102). The input pattern calculating unit 304 calculates the input pattern by using the frequency axis elongated or contracted spectrum (step D103). The matching unit 305 determines the distance between reference pattern (D107) and the input pattern (D104). The sequence of processes from step D101 to step D104, is executed for all the elongation/contraction parameter values stored in the elongation/contraction parameter memory 302 (step D105).
When 10 elongation/contraction parameter values are stored in the elongation/contraction parameter memory 302, the process sequence from step D101 to D104 is repeated 10 times to obtain 10 different distances. The elongation/contraction parameter selecting unit 307 compares the distances corresponding to all the elongation/contraction parameters, and selects the elongation/contraction parameter corresponding to the shortest distance (step D108).
However, the above prior art spectrum converter has the following problems.
The first problem is that increased computational effort is required in the elongation/contraction parameter value determination. This is so because in the prior art spectrum converter it is necessary to prepare a plurality of elongation/contraction parameter values and execute the FFT process, the spectrum frequency elongation/contraction process, the input pattern calculation repeatedly a number of times corresponding to the number of these values.
The second problem is that it is possible to fail to obtain sufficient effects of the frequency elongation and contraction on the voice recognition system. This is so because the elongation/contraction parameter values are all predetermined, and none of these values may be optimum to an unknown speaker.