In past years, significant advances have been achieved in the field of isolated word recognition. However, in continuous speech recognition, especially for a large vocabulary, there still remain many unsolved problems. Both the computational complexity and massive storage requirements make the isolated and connected word recognition strategies based on a word model unfeasible for a large vocabulary needed for continuous speech recognition applications. One of the possible solutions is to recognize the basic phonetic units of the input speech.
One of the methods studied most extensively these days in this field is the phoneme recognition method. The term "phoneme recognition" means conversion of an input voice to a series of phonemes which are substantially equal to pronunciation symbols. The voice converted to such a series of phonemes is then converted, for example, to a suitable letter string (i.e. sentence) by using a word dictionary, grammatical rules and the like.
An advantage of phoneme discrimination is that expansion of vocabularies, recognizable sentence types, etc. can be achieved by separating acoustic level processing and letter string level processing from each other. A method for phoneme discrimination is proposed in "Multi-Level Clustering of Acoustic Features for Phoneme Recognition Based on Mutual Information" Proc. ICASSP-89, pp 604-607 (May 1989).
Further, there is motivation to attempt to recognize speaker-independent phonemes. Good phonetic decoding leads to good word decoding. The ability to recognize the various (e.g. English) phonemes accurately will undoubtedly provide the basis for an accurate word recognizer.
The outline of the conventional phoneme method disclosed in the above publication will be described below.
According to the phoneme discrimination method, the powers of individual frames and acoustic parameters (LLPB Mel-cepstrum coefficients) by an LPC analysis are obtained from input voice signals. Subsequent to the computation of four quantization codes to be described below, the phoneme label (the train of phoneme symbols) of each frame is determined from the combination of these quantization codes.
(1) With respect to each frame, a power-change pattern (PCP) created by differences between the power of the frame of interest and its preceding and succeeding frames is vector-quantized, so that power-change pattern (PCP) code indicative of the power-change patterns (PCP) of the voice waveform is obtained.
(2) As acoustic parameters, cepstrum codes are obtained by vector-quantizing the LPC Mel-cepstrum coefficients while using codebooks classified in advance in accordance with PCP codes.
(3) The gradient of a least square approximation line of the acoustic parameters is vector-quantized to determine a regression coefficient.
(4) The time-series pattern of the PCP codes is vector-quantized to obtain a PCP code sequence.
To achieve a high level of discriminating ability in phoneme discrimination, it is necessary to effectively analyze parameters which serve as various keys of a voice. When a human being wants to distinguish a voice, it has been proven through various experiments that the intensity variations of the voice and the time variance of its spectrum--dynamic information on the voice--become important keys, to say nothing of static information on the voice, namely, the intensity of the voice at a given moment and the tonal feature (spectrum) of the voice.
Although the above-described conventional phoneme discrimination method analyzes power variations, one of the key parameters in phoneme discrimination in the form of a characteristic power-change pattern (PCP) and which also takes into consideration the static information on the spectrum by relying upon acoustic parameters (LPC Mel-cepstrum coefficients), nothing has been taken into consideration in connection with variations in the voice spectrum. These variations are the most important key parameter for the discrimination of similar phonemes. Namely, the conventional phoneme discrimination method involves the problem that its phoneme discriminating ability is insufficient because it relies upon indirect evaluation by a PCP code sequence or the like or upon approximate evaluation by the gradient of the least squares approximation line of the acoustic parameters.
Thus, applying conventional methods to speaker-independent voice recognition systems is difficult because such systems require precise analysis of the spectrum structure, which these methods do not perform.
When a human being distinguishes a voice, he clusters the voice quality through a series of utterances in addition to a judgment based on the static information about the voice. The series of utterances includes a particular spectrum structure defined by the voice quality inherent to the speaker, so that the spectrum structure differs when uttered by a different speaker. A speaker-independent voice recognition system is therefore required to analyze precisely this spectrum structure. However, this aspect has not been taken into consideration. Namely, there is only one codebook to analyze the features of individual spectra, so the coding of all voices is conducted using this codebook. This results in the frequent allotment of a group of codes of a combination which would not occur when uttered by a single speaker, which is one reason for the unsuccessful improvement in recognition performance.