1. Field of the Invention:
The present invention relates to a method of and an apparatus for generating synthesized speech from either a sequence of character codes or a series of phonetic symbols and prosodic information associated therewith.
2. Description of the Prior Art:
Recently, there have been developed various speech synthesizers for analyzing Japanese sentences composed of a mixture of Kanji (Chinese) characters and Kana (Japanese syllabary) characters and generating synthesized speech from phonetic and prosodic information represented by the analyzed sentences according to the synthesis-by-rule process. Such speech synthesis systems are finding wide use in telephone information services in the banking business, newspaper revising systems, document readers, and other apparatus employing synthesized speech.
Basically, the speech synthesizer based on the synthesis-by-rule process operates as follows: The speech synthesizer has a speech segment file which stores phonetic information that has been obtained by the LSP (line spectrum pair) analysis or the cepstrum analysis from each unit of human speech which may be of a syllable structure CV (consonant-vowel), a syllable structure CVC (consonant-vowel-consonant), a syllable structure VCV (vowel-consonant-vowel), or a syllable structure VC (vowel-consonant). When a text is inputted to the speech synthesizer, the speech synthesizer analyzes the text, produces phonetic and prosodic parameters for the text by referring to the speech segment file, and generates and filters sound sources based on the phonetic and prosodic parameters for generating synthesized speech of the text.
It has heretofore been customary to construct the speech synthesizer of dedicated hardware components that are required for real-time data processing. There are primarily two system designs available for the dedicated-hardware speech synthesizer. According to one system, a host computer such as personal computer converts a sentence of Kanji and Kana characters into phonetic and prosodic information, and a dedicated hardware device generates phonetic and prosodic parameters based on the converted phonetic and prosodic information, generates and filters sound sources, and converts the filtered sound sources into an analog speech signal for generating synthesized speech. According to the other system, all the above processing steps are executed by a dedicated hardware device. Usually, the dedicated hardware device of each of the above systems comprises an LSI circuit called a DSP (digital signal processor) which is capable of high-speed logic operations including ANDing and ORing, and a general-purpose MPU (microprocessor unit).
Recent years have seen another system approach to software-implementation of the above processing on a real-time basis. The software-implemented system has been made possible by a personal computer or an engineering work station having a high processing capability combined with a D/A converter, an analog output device, and a loudspeaker.
The software-implemented system is free of problems with respect to speech synthesis while it is processing a relatively few tasks. However, when many tasks require to be processed simultaneously by the system, the system may not be able to generate real-time synthesized speech. If the system fails to generate real-time synthesized speech, then unvoiced intervals are inserted in synthesized words, making it difficult for the user to hear the synthesized words clearly. Specifically, a certain constant period of time is needed for the CPU (central processing unit) of the system to carry out the process of speech synthesis. Therefore, insofar as the CPU of the system operates to process a relatively small number of tasks, it can produce synthesized speech on a real-time basis. However, when the CPU of the system is required to process an increased number of tasks, the CPU requires a longer execution time to process those tasks, possibly failing to generate real-time synthesized speech.
The present speech synthesizer that operates according to the synthesis-by-rule process can produce synthesized speech in different patterns that reflect such differences as sex, age, pronunciation rate, pitch, and stress. The user of the speech synthesizer can select any one of the different speech patterns according to his preference. However, the user cannot change the quality of the synthesized speech.
Most speech synthesizers that are available todaty generate crisp synthesized speech sounds that can be heard clearly. If the user of the speech synthesizer hears such crisp synthesized speech sounds for the first time, then the user will find them acceptable as they are sharp and clear. However, if the user who has become accustomed to synthesized speech hears crisp synthesized speech sounds for a continued period of time, then the user finds them physically and mentally fatiguing. Since the quality of synthesized speech, i.e., the quality of being crisp, cannot be changed, the conventional speech synthesizer does not lend itself to continuous usage for a long period of time.