1. Field of the Invention
This invention relates to speech synthesis and speech analysis, and, more particularly, to improvements in speed and quality thereof.
2. Description of the Related Art
Two popular methods of speech synthesis are speech synthesis by rule and concatenative synthesis using a speech corpus.
In speech synthesis by rule, a given phoneme symbol string is divided into speech units such as phonemes (which correspond to roman letters such as “a” or “k”). Then, the contour of fundamental frequency and a vocal tract transmission function are determined according to rules for each speech unit. Finally, the generated waveforms in a speech unit are concatenated to synthesize speech.
However, continuity distortion results often in the concatenation procedure. To eliminate this continuity distortion, the rules of converting waveform in concatenation procedure can be prepared according to each kind of speech unit. However, this solution requires complex rules and time-consuming procedures.
In concatenative synthesis using a speech corpus, speech waveforms to be composed are obtained by means of extracting sample speech waveform data from the prepared speech corpus and concatenating them. The speech database (speech corpus) stores a large number of speech waveforms of natural speech utterances and their corresponding phonetic information.
Some of the reference books about concatenative synthesis using a speech corpus are Yoshinori Sagisaka: “Speech Synthesis of Japanese Using Non-Uniform Phoneme Sequence Units” Technical Report SP87-136, IEICE, W. N. Campbell and A. W. Black: “Chatr: a multi-lingual speech re-sequencing synthesis system” Technical Report SP96-7, IEICE, and Yoshinori Sagisaka: “Corpus Based Speech Synthesis” Journal of Signal Processing.
With these conventional technologies, in concatenative synthesis using a speech corpus waveforms associated with a given phoneme symbol string are obtained as follows. First, a given phoneme symbol string is divided into phonemes. Next, a sample speech waveform is extracted according to the longest phoneme string-matching method. Then, a speech waveform is obtained from concatenation of extracted pieces of sample speech waveforms.
However, since the speech corpus is searched by a unit of phoneme, the searching procedure requires a massive amount of time. In addition, regardless of how much time is spent in searching, the synthesized speech often comes out unnatural although the longest matching phoneme string can be extracted.