1. Field of the Invention
This invention relates to a method for extracting a speech pitch during processes, such as encoding and synthesizing speech processes. More specifically, it relates to a pitch extracting method which is efficient in extracting the pitch of sequential speech.
2. Description of the Related Art
As demand for a communication terminal rapidly increases with the development of scientific techniques, the typical communication line cannot handle the capacity needed to support such a communication terminal. To solve this problem, methods have been provided for encoding speech at a bit rate below 8 kilobits/second (kbit/s). When processing speech according to those encoding methods, however, a problem of tone quality deterioration occurs. Many investigators are doing wide-ranging studies for the purpose of improving tone quality while processing speech with a low bit rate.
In order to improve tone quality, psychological properties such as musical interval, sound volume, and timbre must be improved. At the same time, physical properties corresponding to the psychological properties, such as pitch, amplitude, and waveform structure, must be reproduced close to the corresponding properties in the original sound. The pitch is called a "fundamental frequency" or "pitch frequency" in a frequency domain, and is called a "pitch interval" or a "pitch" in a spatial domain. Pitch is an indispensable parameter in judging a speaker's gender and distinguishing between a voiced sound and a voiceless sound of uttered speech, especially, when encoding speech in a low bit rate.
At present, three major methods are available for extracting the pitch, namely, a spatial extracting method, a method of extracting in the frequency domain, and a method of extracting in the spatial domain and the frequency domain. An autocorrelation method is representative of the spatial extracting method, the Cepstrum method is representative of a method for extracting in the frequency domain, and an average magnitude difference function (AMDF) method and a method in which a linear prediction coding (LPC) and AMDF are combined are representative methods for extracting in the spatial domain and frequency domain.
In the above conventional methods, a speech waveform is reproduced by applying a voiced sound to every interval of a pitch which is repeatedly reconstructed when processing speech after being extracted from a frame of speech data, where a frame of speech data corresponds to scores of milliseconds of the speech data. In real sequential speech, however, vocal chord or sound properties are changed when a phoneme varies, and the pitch interval is delicately altered by interference even in a frame of scores of milliseconds of the speech data. In the case where neighboring phonemes influence each other, so that speech waveforms which have different frequencies exist together in one frame of sequential speech, an error occurs in extracting the pitch. For example, an error in extracting the pitch occurs at the beginning or end of speech, a transition of the original sound, a frame in which mute and voiced sound exist together, or a frame in which a voiceless consonant and a voiced sound exist together. As described above, the conventional methods are vulnerable to sequential speech problems.