Voice transformation involves parameterization of a speech signal into a mathematical format which can be extensively manipulated such that the properties of the original speech, for example, pitch, speed, relative length of phones, prosody, and speaker identity, can be changed, but still sound natural. A straightforward application of voice transformation is singing synthesis. If the new parametric representation is successfully demonstrated to work well in voice transformation, it can be used for speech synthesis and automatic speech recognition.
Speech synthesis, or text-to-speech (TTS), involves the use of a computer-based system to convert a written document into audible speech. A good TTS system should generate natural, or human-like, and highly intelligible speech. In the early years, the rule-based TTS systems, or the formant synthesizers, were used. These systems generate intelligible speech, but the speech sounds robotic, and unnatural.
Currently, a great majority of commercial TTS systems are concatenative TTS system using the unit-selection method. According to this approach, a very large body of speech is recorded and stored. During the process of synthesis, the input text is first analyzed and the required prosodic features are predicted. Then, appropriate units are selected from a huge speech database, and stitched together. There are always mismatches at the border of consecutive segments from different origins. And there are always cases of required segments that do not exist in the speech database. Therefore, modifications of the recorder speech segments are necessary. Currently, the most popular method of speech modification is the time-domain pitch-synchronized overlap-add method (TD-PSOLA), LPC (linear prediction coefficients), mel-cepstral coefficients and sinusoidal representations. However, using those methods, the quality of voice is severely degraded. To improve the quality of speech synthesis and to allow for the use of a small database, voice transformation is the key. (See Part D of Springer Handbook of Speech Processing, Springer Verlag 2008).
Automatic speech recognition (ASR) is the inverse process of speech synthesis. The first step, acoustic processing, reduces the speech signal into a parametric representation. Then, typically using HMM (Hidden Markov Model), with a statistic language model, the most likely text is thus produced. The state-of-the-art parametric representation for speech is LPC (linear prediction coefficients) and mel-cepstral coefficients. Obviously, the accuracy of speech parameterization affects the overall accuracy. (See Part E of Springer Handbook of Speech Processing, Springer Verlag 2008).