1. Field of the Invention
The present invention relates to the field of concatenative text-to-speech (TTS) voice generation and, mote particularly, to reducing recording time when constructing a concatenative TTS voice using a reduced script and pre-recorded speech assets.
2. Description of the Related Art
Concatenative text-to-speech (TTS) synthesis is based on a concatenation of units of recorded speech. Generally, concatenative TTS systems produce more natural-sounding speech than other synthesis methods, such as formant synthesis. Three main sub-types of concatenative synthesis include diphone synthesis, domain specific synthesis, and unit selection synthesis.
Diphone synthesis uses a minimal speech database containing all the diphones occurring in a language. Only one example of each diphone is contained in a diphone synthesis database. At runtime, target prosody of a sentence is superposed on the diphone units using digital signal processing (DSP) techniques. Diphone synthesis suffers from sonic abnormalities, which are especially pronounced at boundary or splice points. Abnormalities are caused by differences in pitch, volume, time shifting, and other speech characteristics. Few commercial programs use diphone synthesis because it produces results that sound significantly less natural (approximately equivalent to formant results) than other concatenative TTS sub-types and it lacks the robust customization of formant synthesis techniques.
Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. Domain-specific synthesis is often used in applications having limited output options. Output quality of domain-specific synthesis can be very high, but vocabulary breadth for domain-specific syntheses can be low. As a size of the domain-specific synthesis increases, the set of needed phrases geometrically increases. When a needed vocabulary is large, a synthesis technique capable of generating an unlimited vocabulary (such as unit selection synthesis) should be used in place of domain-specific synthesis.
Unit selection synthesis relies on corpus of recorded speech. This corpus is used to create a database of speech assets that together represent a concatenative TTS voice. During database creation, each recorded utterance is segmented into one or more units of varying size, which include phones, syllables, morphemes, words, phrases, and sentences. Each unit in the database is indexed based on acoustic parameters that can include pitch, duration, power, position in a syllable, neighboring phones, and/or the like. At runtime, a desired utterance is produced by determining a best set of candidate units from the database. The determination is typically based using one or more weighted decision trees. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. A vocabulary of unit selection synthesis is unlimited so long as enough units of speech are provided for a complete phonetic coverage. Maximum naturalness typically requires unit selection speech databases to be very large. In many natural sounding unit selection synthesis systems, gigabytes of storage are needed for the recorded units of speech. In some circumstances, compression technologies can reduce an amount of needed storage space for unit selection synthesis to more manageable sizes. A minimum recording time of dozens of hours may be required to generate speech recordings for a concatenative TTS voice (for unit selection synthesis).
Accordingly, considerable development effort and cost is required to record a speech and then to process the recorded speech to generate speech assets needed for full phonetic coverage of a single TTS voice (for unit selection synthesis). This effort must be repeated for every concatenative TTS voice generated. Many parties interested in creating custom TTS voices, such as custom voices for a telematics system, often find the cost of creating new voices prohibitive. Additionally, uniform recording conditions are necessary to generate a clean speech corpus. Conventionally, a voice talent reads a reference script in a recording studio, where the reference script is specifically constructed to result in a speech corpus that produces a TTS voice having full phonetic coverage. Costs of producing a TTS voice for unit selection synthesis would be substantially lower if the size of the script, which the voice talent speaks, was minimized.