The present invention relates to a method of analyzing an input speech signal and a speech analysis apparatus thereof.
In a conventional speech-recognition apparatus, an utterance-practicing apparatus for hearing-impaired people, a communications system using speech analysis and synthesis, or a speech synthesizing apparatus, an input speech signal is analyzed and its features are extracted, so as to perform desired processing. The input speech signal is analyzed on the basis of its frequency spectrum. Human auditory sensitivity for temporal changes in waveform of the speech signal is worse than that for the spectrum thereof. Signals having an identical spectrum are recognized as an identical phoneme.
A voiced sound portion of a speech signal has a structure of a cyclic signal generated by vibrations of the vocal cord. The frequency spectrum of the voiced sound has a harmonic spectrum structure. However, an unvoiced sound portion of the speech signal does not accompany vibrations of the vocal cord. The unvoiced sound has a sound source as noise generated by an air stream flowing through the vocal tract. As a result, the frequency spectrum of the unvoiced sound does not have a cyclic structure that of the harmonic spectrum. There are two conventional speech analysis schemes in accordance with these frequency spectra. One scheme assumes a cyclic pulse source as a sound source of the input speech signal, and the other assumes a noise source. The former is known as speech analysis using cepstrum analysis, and the latter speech analysis scheme is known as speech analysis using an auto-recurrence (AR) model. According to these speech analysis schemes, microstructures are removed from the spectrum of the input speech signal, to obtain a so-called spectrum envelope.
In the analysis of the input speech signal according to the AR model or the cepstrum analysis scheme to obtain the spectrum envelope, both schemes assume a stationary stochastic process. If the phoneme changes as a function of time, such a conventional analysis scheme cannot be applied. In order to solve this problem, the signal is extracted in a short time region such that the system does not greatly change. The extracted signal is multiplied by a window function, such as a Hamming window or a Hanning window, so as to eliminate the influence of an end point, thereby obtaining a quasi-stationary signal as a function of time. The quasi-stationary signal is analyzed to obtain the spectrum envelope. This envelope is defined as the spectrum envelope at the extraction timing of the signal.
In order to obtain the spectrum of the input speech signal according to the conventional speech analysis scheme, an average spectrum of a signal portion extracted for a given length of time (to be referred to as a frame length hereinafter), is obtained. For this reason, in order to sufficiently extract an abrupt change in spectrum, the frame length must be shortened. In particular, at a leading edge of a consonant, its spectrum is spontaneously changed within several milliseconds, and the order of frame length must be several milliseconds. With this arrangement, the frame length is approximately equal to the pitch period of vibrations of the vocal cord. The precision of spectrum extraction largely depends on the timing and degree of the vocal cord pulse included within the frame length. As a result, the spectrum cannot be stably extracted.
It is assumed that the problem described above is caused since the dynamic spectrum, as a function of time, is analyzed by a model assuming a stationary stochastic process.
In conventional spectrum extraction, the time interval (to be referred to as a frame period) must be shortened upon shifting the frame position for extracting the signal, so as to follow rapid changes in the spectrum. However, if the frame period is shortened into, halves, for example, the number of frames to be analyzed is doubled. In this manner, shortening of the frame period greatly increases the amount of data to be processed. For example, the amount of data obtained by A/D-converting a 1-second continuous speech signal at a 50-.mu.sec pitch, is 20,000. However, if the above data length is analyzed using a 10-msec frame length and a 2-msec frame period, the number of frames to be analyzed is: EQU 1 s.div.0.002 s=500
As a result, the amount of data to be analyzed is: EQU (10 msec.div.0.05 msec).times.500=100,000
and the number of data is increased by five times.
As is described above, in a conventional speech analysis scheme based on the stationary stochastic process, abrupt changes in spectrum at a dynamic portion such as a leading edge of the consonant, cannot be stably analyzed with high precision. If the frame period is shortened, the amount of data which must be processed is greatly increased.
Another conventional method for effectively analyzing a speech signal is frequency analysis, using a filter bank. According to this analysis method, an input speech signal is supplied to a plurality of bandpass filters having different center frequencies, and outputs from the filters are used to constitute a speech-power spectrum. This method has advantages in having easy hardware arrangement and real-time processing.
Most of the conventional speech analysis methods determine spectrum envelopes of input speech signals. A method of finally analyzing the speech signal from the determined spectrum envelope is known as formant analysis, for extracting formant frequency and width from a local peak, in order to analyze the input speech signal. This analysis method is based on the facts that each vowel has a specific formant frequency and width, and that each consonant is characterized by the change in formant frequency in the transition from the consonant to a vowel. For example, five Japanese vowels ("a", "i", "u", "e", and "o") can be defined by two formant frequencies F1 and F2, F1 being the lowest formant frequency, and F2 is the next one. Being substantially equal, frequencies F1 and F2 are used for voices uttered by persons of the same sex and the about same age. Therefore, the vowels can be identified by detecting formant frequencies F1 and F2.
Another conventional method is also known, for extracting local peaks of the spectrum envelope and for analyzing these peaks, based on their frequencies and temporal changes. This method is based on the assumption that phonemic features appear in the frequencies of local peaks of the vowel portion, or in the temporal changes in local peaks of the consonant portion.
Still another conventional method is also proposed, for defining a spectrum envelope curve itself as a feature parameter of the speech signal and to use the feature parameters in the subsequent identification, classification, or display.
In the analysis of a speech signal, it is important to extract the spectrum envelope. Excluding the spectrum envelope itself, the formant frequency and width derived from the envelope, and the frequency and transition of the local peak can be used as feature parameters.
When a person utters a sound, its phoneme is assumed to be defined by resonance/antiresonance of the vocal tract. For example, a resonant frequency appears as a formant on the spectrum envelope. Therefore, if different persons have an identical vocal tract structure, substantially identical spectra are obtained for an identical phoneme.
However, in general, if persons, for example, male vs. female, or child vs. adult, have greatly different vocal tract lengths, the resonant or antiresonant frequencies are different from each other, and the resultant spectrum envelopes are different accordingly. In this case, the local peaks and formant frequencies are shifted from each other for an identical phoneme. This fact is inconvenient for an analysis aiming at extracting identical results for identical phonemes, regardless of the speakers, as in the cases of speech recognition and visual display of speech for hearing-impaired persons.
In order to solve the above problems, two conventional methods are known. One is a method for preparing a large number of standard patterns, and the other is a method for determining a formant frequency ratio.
In the former method, a large number of different spectrum envelopes of males and females, adults and children, are registered as the standard patterns. Unknown input patterns are classified on the basis of similarities between these unknown patterns and the standard patterns. Therefore, a large number of different indefinite input speech signals can be recognized. According to this method, in order to recognize similarities between the standard patterns and any input speech patterns, a very large number of standard patterns must be prepared. In addition, it takes a long period of time to compare input patterns with the standard patterns. Furthermore, this method does not extract the results normalized by the vocal tract lengths, and therefore cannot be used for displaying phonemic features not dependent on the vocal tract lengths.
The latter method, i.e., the method of determining the formant frequency ratio, is known as a method of extracting phonemic features not based on the vocal tract lengths. More specifically, among the local peaks in the spectrum envelope, first, second, and third formant frequencies F1, F2, and F3, which are assumed to be relatively stable, are extracted for vowels, and ratios F1/F3 and F2/F3 are calculated to determine the feature parameter values. If the vocal tract length is multiplied by a, the formant frequencies become 1/a times, i.e., F1/a, F2/a, and F3/a. However, the ratios of the formant frequencies remain the same.
The above method is effective if the first, second, and third formants of the vowels can be stably extracted. However, if these formants cannot be stably extracted, the analytic reliability is greatly degraded. Furthermore, this method is not applicable to consonants. That is, the formant as the resonant characteristics of the vocal tract cannot be defined for the consonants, and the local peaks corresponding to the first, second, and third formants cannot always be observed on the spectrum envelope. As a result, frequencies F1, F2, and F3 cannot be extracted or used to calculate their ratios. At a leading or trailing edge of a vowel as well as for a consonant, the formants are not necessarily stable, and a wrong formant frequency is often extracted. In this case, the ratio of the formant frequencies is discretely changed and presents a completely wrong value. Therefore, the above method is applicable to only stable portions of vowels of the speech signal. Another method must be used to analyze the leading and trailing edges of the vowels and the consonants. Since different extraction parameters must be used for the stable portions of the vowels and other portions including the consonants, it is impossible to describe continuous changes from a consonant to a vowel. In short, the method of calculating the ratio of the formant frequency is applicable only to stationary vowel portions.
No conventional methods have been proposed to extract feature parameters inherent to phonemes from a large number of indefinite spectrum envelopes derived from different vocal tract lengths.