Concatenative speech synthesis is commonly used in text-to-speech and concept-to-speech software devices. In text-to-speech devices, text is converted to speech. In concept-to-speech devices, a concept (such as “What is the stock price for X company today?”) is converted to speech.
In concatenative speech synthesis, speech is generated by concatenating stored speech segments. The stored speech segments are selected to conform to the text or concept being synthesized, then the speech segments are concatenated to create a synthesized utterance. Prior to concatenation, acoustic features of the stored speech segments are modified to make the speech segments match requested features of the synthesized utterance. These features comprise duration, energy, fundamental frequency (called “pitch” herein), and spectral envelope of the speech segments. The features are determined by modules in the concatenative speech synthesis system, and are determined in such a way as to make the resultant speech sound relatively natural.
There are many algorithms to modify the pitch of speech segments. Among these algorithms are the parametric techniques, like linear predictive coding techniques. These techniques are generally considered to have poor output quality. Most popular concatenative speech synthesizers use time domain techniques because of their simplicity and high quality output. For example, U.S. Pat. Nos. 5,327,498 and 5,524,172, the disclosures of which are hereby incorporated by reference, describe a time domain technique that is commonly used in concatenative speech synthesizers. However, these time domain techniques can produce poor quality when the pitch for a speech segment is changed to a high degree, especially at low sampling rates where pitch basically has a larger impact.
To overcome the time domain technique problems, more complex algorithms have been used to modify the pitch of the speech segments. For example, an algorithm to perform the pitch modification in the frequency domain rather than the time domain has been used. Also great success has been achieved by developing algorithms that use a sinusoidal representation of the speech signal. Results show that those techniques outperform, in terms of speech output as judged by human tests, the time domain methods and leave room for further research and enhancement while the time domain methods do not.
However, the later algorithms are known for their computational complexity, which makes them impractical to use in commercial concatenative speech synthesizers. To overcome this problem, i.e., to enhance the performance of the speech synthesizers while using these techniques, fast algorithms for each particular technique were introduced. For example, many realizations of fast Fourier transform algorithms have been used to reduce the complexity of the frequency domain techniques, while quick methods for calculating a cosine function are used in techniques using the sinusoidal representation of speech signals. Nonetheless, the computation complexity of the later algorithms is still high, as is the time required to execute the algorithms.
Thus, even though improvements in concatenative speech synthesis have been made, there still exists a need for increasing the speed of concatenative speech synthesis while maintaining output voice signal quality.