With the prosperous development of computer technology and the rapid growth of information-related industrial applications, computer technological development has already progressed from its original operations-orientation to its orientation on communication and information exchange. In this process, the majority of the early studies focused on the methods of how to provide the most useful and valuable information, information indexing systems, Internet search engines, and data mining technology. However, the end of information is for the users so that the end-users can engage in information exchange with the computer system by means of the most natural and direct way, so as to maximize the effectiveness to the end-users. As the most natural way for people to receive information is by means of speech, this Chinese Text-To-Speech (TTS) synthesis technology has long become an important part of man-machine communication and interaction.
Prior technology differs with the methods for generating sound waveforms. The Text-To-Speech (TTS) Systems can be classified into two major types, namely, the VOCODER (voice coder-decoder) and the Concatenative Synthesizer: the former re-calculates and then transforms the speech parameters into speech waveforms by means of the articulation model, so that the modulation range of the speech parameters becomes wider, but the quality of synthesized speech is poorer; the latter concatenates human-recorded sound fragments (synthesis units) into the waveforms of the target sentence. Although it produces a poorer speech modulation, it produces a better synthesis quality.
In these two major types of the TTS systems, the VOCODER has a longer history. In the mid-20th century, H. K. Dunn, George, & Noriko, et. al. proposed the Articulatory Synthesis based on human articulatory organs; Walter Lawrence and Gunnar proposed the Formant Synthesizer based on formant parameters; till 1968, Itakura and Saito applied the Linear Predictive Coding (LPC) technology, so that the LPC synthesizer evolved. However, the sound quality synthesized by these methods was usually poor. By the end of 1970's, some scholars started to directly concatenate speaker-dependent sound fragments (synthesis units), so as to generate higher quality computer synthetic sounds. In 1978, Fallside and Young proposed the word unit synthesis (or content-to-speech) architecture based on finite vocabulary; in the same year, Fujimura and Lovisn proposed a syllable-based speech synthesizer. In addition to these, a large number of methods based on the length of phones, di-phones, and tri-phones as the synthesis units were made public. Till the 21st century, some scholars started to use the Variable Length Unit selection scheme, and among them, the Multiform Unit proposed by Satoshi Takano and the Variable Length Unit proposed by Yi were more notable representatives.
In this field, the Chinese syllables, nowadays, are mostly used as the synthesis units, tagged with a variety of prosodic module technology, and then modulated into the rhythm of synthesized speech, after the sound fragments have been concatenated. However, the synthesis units only based on syllables definitely are unable to maintain the prosodic information above the word level. No matter how mature the prosodic module technology has become, and if the signal processing technology is unable to undergo a breakthrough, the effects of such methods are only limited.