The present invention relates to a speech recognition apparatus and, more particularly, to a speech recognition technique using the intensity information of fundamental frequency components.
In speech recognition techniques, the most popular scheme is a scheme of converting input speech into a feature vector by analyzing the waveform of input speech within an analysis window (frame), which moves at predetermined time intervals, for a short period of time, and handling the entire input speech as a time series signal representing the feature vector, thereby performing matching. Various analysis schemes for this feature vector have bee proposed. Although not all of them can be described here, they include cepstrum analysis, spectrum analysis, power analysis, and the like.
A conventional scheme of using fundamental frequency (to be referred to as "pitch" hereinafter) information as part of a feature vector has been used to improve the speech recognition performance. This scheme has an arrangement like the one shown in FIG. 4. More specifically, a feature extraction section 41 converts input speech into feature vectors (e.g., a cepstrum or the like) which have been used for general speech recognition. A pitch extraction section 42 converts the input speech into pitch frequencies or the time derivatives thereof, which are output as feature vectors for recognition to a recognition section 43, together with the feature vectors output from the feature extraction section 41. The recognition section 43 performs matching between the feature vectors output from the feature extraction section 41 and the pitch extraction section 42 and the standard patterns analyzed by the same feature vector configuration in advance, and outputs the most likelihood vector as the recognition result.
A speech recognition apparatus of this type is designed to avoid vowel/consonant (voiced sound/unvoiced sound) segmentation errors in matching by including pitch information as part of a feature vector, thereby realizing high performance. Vowels of speech are generated when the vocal tract is driven by a pulse-like sound source generated by opening/closing of the glottis. The vowels therefore have clear periodic structures and are observed as pitches.
In contrast to this, consonants (especially unvoiced consonants) are generated by an aperiodic sound source other than the glottis. The consonants do not therefore have clear periodic structure, and no clear pitches are observed. For these reasons, errors in matching between vowel and consonant portions can be reduced by using pitch information.
It is another purpose of a speech recognition apparatus of a type to identify the pitch pattern of a tonal language such as Chinese. However, since this purpose differs from the objects of the present invention, a detailed description thereof will be omitted.
In a conventional speech recognition apparatus using pitch information, frequency information about pitches is directly used or used as the time derivatives of frequency information. When frequency information about pitches is directly used, the information greatly varies among individuals as well as sex. In addition, the frequency information obtained from a given individual is easily influenced by his/her physical condition and a psychological factor, and hence is unstable. That is, such information disturbs the speech recognition apparatus and cannot be an effective parameter. Since this information greatly varies among individuals, in particular, the information is a feature vector unsuited for a parameter for unspecific speaker speech recognition.
In a speech recognition apparatus using the time derivatives of pitch frequency information, such information greatly varies among individuals and areas. For example, even the same contents of an utterance greatly vary in pitch among dialects and the like. This tendency directly reflects in the time derivatives of pitch frequency information. This information therefore becomes a parameter that greatly varies among individuals and areas. That is, this parameter is not useful for unspecific speaker speech recognition.