1. Field of the Invention
The present invention relates to a text-to-speech synthesis, particularly a speech synthesis method of generating a synthesized speech from information such as phoneme symbol string, pitch, and phoneme duration.
2. Description of the Related Art
“Text-to-speech synthesis” means producing artificial speech from text. This text-to-speech synthesis system comprises three stages: a linguistic processor, prosody processor and speech signal generator.
At first, the input text is subjected to morphological analysis or syntax analysis in a linguistic processor, and then the process of accent and intonation is performed in the prosody processor, and information such as phoneme symbol string, pitch pattern (the change pattern of voice pitch), and the phoneme duration is output. A speech signal generator, that is, speech synthesizer synthesizes a speech signal from information such as phoneme symbol strings, pitch patterns and phoneme duration.
According to the operational principle of a speech synthesis apparatus for speech-synthesizing a given phoneme symbol string, basic characteristic parameters units (hereinafter referred to as “synthesis units”) such as phone, syllable, diphone and triphone are stored in a storage and selectively read out. The read-out synthesis units are connected, with their pitches and phoneme durations being controlled, whereby a speech synthesis is performed.
As a method for generating a speech signal of a desired pitch pattern and phoneme duration from information of synthesis units, the PSOLA (Pitch-Synchronous Overlap-add) method is known. It is known that synthesized speech based on PSOLA reduces speech quality degradation due to pitch period variation, and improves speech quality, when the pitch period variation is small. However, PSOLA has a problem in that speech quality deteriorates when the pitch period variation is large. Further, there is a problem that distortion occurs in the spectrum due to the smoothing process performed when a discontinuous spectrum occurs when synthesis units are combined, resulting in deterioration in the speech quality. Furthermore, PSOLA makes change of voice variety difficult and lack flexibility since the waveform itself is used as a synthesis unit.
An alternative method involves a formant synthesis. This system was designed to emulate the way humans speak. The formant synthesis system generates a speech signal by exciting a filter modeling the property of vocal tract with a speech source signal obtained by modeling a signal generated from the vocal cords.
In this system, the phonemes (/a/, /i/, /u/, etc) and voice variety (male voice, female voice, etc.) of synthesized speech are determined by combining the formant frequency with the bandwidth. Therefore, the synthesis unit information is generated by combining the formant frequency with the bandwidth, rather than the waveform. Since the formant synthesis system can control parameters relating to phoneme and voice variety, it is advantageous in that variations in the voice variety and so on can be flexibly controlled. However, the precision of modeling lacks, which is disadvantageous.
In other words, the formant synthesis system cannot mimic the finely detailed spectrum of real speech signal because only the formant frequency and bandwidth are used, meaning that speech quality is unacceptable.
It is an object of the present invention to provide a speech synthesizer, which improves a speech quality and can flexibly control voice variety.