1. Field of the Invention
This invention relates to speech processing like speech synthesis and, more particularly, to pitch conversion process.
2. Description of the Related Art
Concatenative Synthesis is a known speech synthesis. In this method, speech sound is synthesized by means of concatenating the prepared sound waveforms. However, there is a problem that natural sounding speech can not be obtained simply from the concatenation of the prepared waveforms because of the incapability of intonation control.
In order to solve this problem, PSOLA (Pitch Synchronous Overlap Add) method has been suggested. In this method, speech sound with the different pitch length can be obtained by filtering two pitch-unit speech waveforms through a Hanning window and making them slightly overlapped each other. (E. Moulines et. al, “Pitch-Synchronous waveform processing techniques for text-to-speech synthesis using diphones” Speech Communication, 1990.9).
Referring to FIG. 22 and FIG. 23, the PSOLA method is described as follows. FIG. 22 shows a part of speech waveform. The waveform is repeated almost periodically. This one repeating unit is a pitch. Pitch of the sound varies depending on this pitch length.
In the PSOLA method, at first, a waveform is clipped out with its peak point of M as a center using a Hanning window as shown in FIG. 23. Next, the clipped waveforms are overlapped until their pitch lengths agree with the target pitch length. The width of the Hanning window for filtering is set in such a way that the clipped waveforms will be overlapped by one half. Thus, pitch can be converted to minimize the generation of undesirable frequency components. Therefore, if pitch is converted by modifying fundamental frequency using the PSOLA method, the intonation can be controlled.
However, the PSOLA method still has following problems.
Firstly, as shown in FIG. 24 to FIG. 27, unnatural reduction of amplitude might happen in the segment where waveforms are overlapping. FIG. 24 shows an original waveform (indicated with a damped sine wave for easy understanding). FIG. 25 shows the waveform filtered through the left side components of a Hanning window. FIG. 26 shows the waveform filtered through the right side components of a Hanning window. FIG. 27 shows a composite waveform. As indicated in FIG. 27, the unnatural reduction in amplitude appears in the middle part of a pitch. This amplitude reduction causes a distortion of microstructure of speech waveform represented by formant.
Secondly, another problem is that echoes are produced with the contiguous pitch peaks as shown in FIG. 28. This is indicated in H. Kawai, et. al. “A study of a text-to-speech system based on waveform splicing,” Tech. Rep. of the Institute of Electronics, Information and Communication Engineers, SP93–9, pp. 49–54, Japan (1993,5) (in Japanese, the abstract in English). In this literature, the writer proposes the use of a trapezoidal window. However, using the mentioned trapezoidal window might still produce undesirable frequency components during the process of overlapping that make the synthesized sound unnatural.
As shown in FIG. 1, a speech waveform in a pitch-unit is considered to be divided into two segments: 1) the segment of β, that starts from the minus peak at which the waveform depending on the shape of vocal tracts appears and 2) the segment of γ at where the waveform, depending on the vocal tract shape, is attenuating and converging on the next minus peak. In addition, a in FIG. 1 is the point at which a minus peak appears along with the glottal closure. In the described PSOLA method, the center of the Hanning window is set at around the peak of M during a pitch with the goal of maintaining the contour of waveform around the peak of M. However, putting too much emphasis on the maintenance of the waveform contour around the peak brought about the above-described problems.