This invention relates to speech and, more particularly, to a technique that enables the modification of a speech signal so as to enhance the naturalness of speech sounds generated from the signal.
Concatenative text-to-speech synthesizers, for example, generate speech by piecing together small units of speech from a recorded-speech database and processing the pieced units to smooth the concatenation boundaries and to match the desired prosodic targets (e.g. speaking speed and pitch contour) accurately. These speech units may be phonemes, half phones, di-phones, etc. One of the more important processing steps that are taken by prior art systems, in order to enhance naturalness of the speech, is modification of pitch (i.e., the fundamental frequency, F0) of the concatenated units, where pitch modification is defined as the altering of F0. Typically, the prior art systems do no not modify the magnitude spectrum of the signal. However, it has been observed that large modification factors for F0 lead to a perceptible decrease in speech quality, and it has been shown that at least one of the reasons for this degradation is the assumption by these prior art system that the magnitude spectrum can remain unaltered. In particular, T. Hirahara has shown in “On the Role of Fundamental Frequency in Vowel Perception,” The Second Joint Meeting of ASA and ASJ, November 1988, that an increase of F0 was observed to cause a vowel boundary shift or a vowel height change. Also, in “Vowel F1 as a Function of Speaker Fundamental Frequency,” 110th Meeting of JASA, vol. 78, Fall 1985, A. K. Syrdal and S. A. Steele showed that speakers generally increase the first formant as they increase F0. These results clearly suggest that the magnitude spectrum must be altered during pitch modification. Recognizing this need, K. Tanaka and M. Abe suggested, in “A New fundamental frequency modification algorithm with transformation of spectrum envelope according to F0,” ICASSP vol. 2, pp. 951-954, 1997, that the spectrum should be modified by a strectched difference vector of a codebook mapping. A shortcoming of this method is that only three ranges of F0 (high, middle, and low) are encoded. A smoother evolution of the magnitude spectrum (of an actual speech signal), or the spectrum envelope (of a synthesized speech signal), as a function of changing F0 is desirable.