Effective text-to-speech (TTS) conversion requires not only that the acoustic TTS output be phonetically correct, but also that it faithfully reproduce the sound and prosody of human speech. When the range of phrases and sentences to be reproduced is fixed, and the TTS converter has sufficient memory resources, it is possible simply to record a collection of all of the phrases and sentences that will be used, and to recall them as required. This approach is not practical, however, when the text input is arbitrarily variable, or when speech is to be synthesized by a device having only limited memory resources, such as an embedded speech synthesizer in a mobile computing or communication device, for example.
Concatenative TTS synthesis has been developed in order to synthesize high-quality speech from an arbitrary text input. For this purpose, a large database is created, containing speech segments in a variety of different phonetic contexts. For any given text input, the synthesizer then selects the optimal segments from the database. The “optimal” segments are generally those that, when concatenated with the previous segments, provide the appropriate phonetic output with the least discontinuity and best match the required prosody. For example, U.S. Pat. No. 5,740,320, whose disclosure is incorporated herein by reference, describes a method of text-to-speech synthesis by concatenation of representative phoneme waveforms selected from a memory. The representative waveforms are chosen by clustering phoneme waveforms recorded in natural speech, and selecting the waveform closest to the centroid of each cluster as the representative waveform for the cluster.
In some systems, the encoding of speech segments in the database and the selection of segments for concatenation are based on a feature representation of the speech, such as mel-frequency cepstral coefficients (MFCCs). (These coefficients are computed by integration of the spectrum of the recorded speech segments over triangular bins on a mel-frequency axis, followed by log and discrete cosine transform operations.) Methods of feature-based concatenative speech synthesis are described, for example, in U.S. Pat. No. 6,725,190 and in U.S. patent application Publication US 2001/0056347 A1, whose disclosures are incorporated herein by reference. Further aspects of concatenative speech synthesis are described in U.S. Pat. Nos. 4,896,359, 5,165,008, 5,751,907, 5,913,193, and 6,041,300, whose disclosures are also incorporated herein by reference.
A number of TTS products using concatenative speech generation methods are now commercially available. These products generally use a large speech database (typically 100 MB-1 GB) in order to avoid auditory discontinuities and produce pleasant-sounding speech with widely-variable pitch. For some applications, however, this memory requirement is excessive, and new TTS techniques are needed in order to reduce the database size without compromising the quality of synthesized speech. Chazan et al. describe work directed toward this objective in a paper entitled “Reducing the Footprint of the IBM Trainable Speech Synthesis System,” in ICSLP—2002 Conference Proceedings (Denver, Colo.), pages 2381-2384, which is incorporated herein by reference.