1. Field of the Invention
The present invention relates to a speech synthesis apparatus that produces synthetic speech by driving a vocal tract filter according to a speech source signal, and more particularly to a speech synthesis apparatus that produces synthetic speech from pieces of information including phoneme symbol string, pitch, and phoneme duration for text-to-speech synthesis.
2. Description of the Related Art
The act of producing a speech signal artificially from a given sentence is known as text-to-speech synthesis. The text synthesis system usually comprises a speech processor, a phoneme processor, and a speech signal generator. The inputted text is subjected to Morphological analysis and syntax analysis at the speech processor. Next, the phoneme processor subjects the analysis results to accent processing and intonation processing to produce information including phoneme symbol strings, pitch patterns, phoneme duration, etc. Finally, the speech signal generator, or speech synthesis apparatus, selects feature parameters of small basic units (synthesis unit), including syllables, phonemes, and one-pitch intervals, according to such information as phoneme symbol strings, pitch patterns, and phoneme duration, connects them by controlling their pitch and duration, thereby producing synthetic speech.
One known speech synthesis apparatus that can synthesize any phoneme symbol string by controlling the pitch and phoneme duration is such that a residual waveform is used at the voiced speech source in the vocoder system. The vocoder system, as is well known, is a method of generating synthetic sound by modeling a speech signal in a manner that separates the speech signal into speech source information and vocal tract information. Normally, a voiced speech source is modeled into an impulse train and an unvoiced speech source is modeled by noise.
A conventional typical speech synthesis apparatus in the vocoder system comprises a frame information generator, a voiced speech source generator, an unvoiced speech source generator, and a vocal tract filter. According to the phoneme symbol string, pitch pattern, and phoneme duration, the frame information generator outputs frame average pitch, frame average power, voiced/unvoiced speech source information, and filter coefficient selecting information for each frame to be synthesized. Using the frame average pitch and frame average power, the voiced speech source generator generates a voiced speech source expressed by impulse trains spaced at regular frame average pitch intervals in a voiced interval judged on the basis of the voiced/unvoiced speech source information. Using the frame average power, the unvoiced speech source generator generates an unvoiced speech source expressed by white noise in an unvoiced interval judged on the basis of the voiced/unvoiced speech source information. The filter coefficient storage section outputs filter coefficients according to the filter coefficient selecting information. The vocal tract filter causes a voiced speech source or an unvoiced speech source to drive the vocal tract filter having the filter coefficient, and outputs synthetic speech.
Such a vocoder system loses a delicate feature for each pitch interval of voiced speech because impulse trains are used as a speech source, resulting in degradation of the sound quality of synthetic speech. To solve this problem, an improved method capable of preserving the minute structure of speech has been developed. The method uses as a voiced speech source signal a residual signal waveform indicating a prediction residual error obtained by analyzing speech with an inverse filter. Namely, by repeating a one-pitch-long residual signal waveform, instead of impulses, at regular frame average pitch intervals, a voiced speech source signal is generated. In this case, because the residual signal waveform must be changed according to the vocal tract characteristic, the residual signal waveform is changed frame by frame.
In the improved speech synthesis method, however, the voiced speech source signal is generated in a frame by repeating a typical waveform serving as the basis of the voiced speech source at regular pitch intervals, so that the residual signal waveform and the pitch are discontinuous at the boundary between frames, resulting in the problem that the phoneme of synthetic speech and the pitch change are unnatural.