As a conventional speech synthesis method of generating desired synthetic speech, a method of generating desired synthetic speech by segmenting each of speech segments which are recorded and stored in advance into a plurality of micro-segments, and re-arranging the micro-segments obtained as a result of segmentation is available. Upon re-arranging these micro-segments, the micro-segments undergo processes such as interval change, repetition, skipping (thinning out), and the like, thus obtaining synthetic speech having a desired duration and fundamental frequency.
FIG. 17 illustrates the method of segmenting a speech waveform into micro-segments. The speech waveform shown in FIG. 17 is segmented into micro-segments by a cutting window function (to be referred to as a window function hereinafter). At this time, a window function synchronized with the pitch interval of source speech is used for a voiced sound part (latter half of the speech waveform). On the other hand, a window function with an appropriate interval is used for an unvoiced sound part.
By skipping one or plurality of micro-segments and using remaining micro-segments, as shown in FIG. 17, the continuation duration of speech can be shortened. On the other hand, by repetitively using these micro-segments, the continuation duration of speech can be extended. Furthermore, by narrowing the intervals between neighboring micro-segments in a voiced sound part, as shown in FIG. 17, the fundamental frequency of synthetic speech can be increased. On the other hand, by broadening the intervals between neighboring micro-segments in a voiced sound part, the fundamental frequency of synthetic speech can be decreased.
By superposing re-arranged micro-segments that have undergone the aforementioned repetition, skipping, and interval change processes, desired synthetic speech can be obtained. As units upon recording and storing speech segments, units such as phonemes, or CV·VC or VCV are used. CV·VC is a unit in which the segment boundary is set in phonemes, and VCV is a unit in which the segment boundary is set in vowels.
However, in the above conventional method, since a window function is applied to obtain micro-segments from a speech waveform, a speech spectrum suffers so-called “blur”. That is, phenomena such broadened formant of speech, unsharp top and bottom peaks of a spectrum envelope, and the like occur, thus deteriorating the sound quality of synthetic speech.