The present invention relates generally to speech synthesis systems, and more particularly to a text-to-speech synthesizer.
Two approaches are available for text-to-speech synthesis systems. In the first approach, speech parameters are extracted from human speech by analyzing semisyllables, consonants and vowels and their various combinations and stored in memory. Text inputs are used to address the memory to read speech parameters and an original sound corresponding to an input character string is reconstructed by concatenating the speech parameters. As described in "Japanese Text-to-Speech Synthesizer Based On Residual Excited Speech Synthesis", Kazuo Hakoda et al., ICASSP '86 (International Conference On Acoustics Speech and Signal Processing '86, Proceedings 45-8, pages 2431 to 2434), Linear Predictive Coding (LPC) technique is employed to analyze human speech into consonant-vowel (CV) sequences, vowel (V) sequences, vowel-consonant (VC) sequences and vowel-vowel (VV) sequences as speech units and speech parameters known as LSP (Line Spectrum Pair) are extracted from the analyzed speech units. Text input is represented by speech units and speech parameters corresponding to the speech units are concatenated to produce continuous speech parameters. These speech parameters are given to an LSP synthesizer. Although a high degree of articulation can be obtained if a sufficient number of high-quality speech units are collected, there is a substantial difference between sounds collected from speech units and those appearing in texts, resulting in a loss of naturalness. For example, a concatenation of recorded semisyllables lacks smoothness in the synthesized speech and gives an impression that they were simply linked together.
According to the second approach, rules for formant are derived from strings of phonemes and stored in a memory as described in "Speech Synthesis And Recognition", pages 81 to 101, J. N. Holmes, Van Nostrand Reinhold (UK) Co. Ltd. Speech sounds are synthesized from the formant transition patterns by reading the formant rules from the memory in response to an input character string. While this technique is advantageous for improving the naturalness of speech by repetitive experiments of synthesis, the formant rules are difficult to improve in terms of constants because of their short durations and low power levels, resulting in a low degree of articulation with respect to consonants.