The present invention relates to speech synthesis. In particular, the present invention relates to adaptation of general-purpose text-to-speech systems to specific domains.
Text-to-speech (TTS) technology enables a computerized system to communicate with users utilizing synthesized speech. With newly burgeoning applications such as spoken dialog systems, call center services, and voice-enabled web and email services, increasing emphasis is put on generating natural sounding speech. The quality of synthesized speech is typically evaluated in terms of how natural or human-like are produced speech sounds.
Simply replaying a recording of an entire sentence or paragraph of speech can produce very natural sounding speech. However, the complexity of human languages and the limitations of computer storage make it impossible to store every conceivable sentence that may occur in a text. Instead, systems have been developed to use a concatenative approach to speech synthesis. This concatenative approach combines stored speech samples representing small speech units such as phonemes, diphones, triphones, syllables or the like to form a larger speech signal unit.
Concatenation based speech synthesis has been widely adopted and rapidly developed. To some extent, this type of speech synthesis involves collecting, annotating, indexing and retrieving speech units within large databases. Accordingly, it follows that the naturalness of the synthesized speech depends to some extent on the size and coverage of a given unit inventory. Due to the complexity of human languages and the limitations of computer storage and processing, generally expanding the unit inventory is not a particularly efficient way to increase naturalness of speech for a general-purpose TTS system. However, expanding the unit inventory is a reasonable method for increasing naturalness of a specific domain for a domain-specific TTS system.
The simplest way for generating speech prompt in domain-specific applications is to play back a collection of pre-stored waveforms for words, phrases and sentences. When the domain is narrow and closed, very natural speech prompt can be generated with this method at relatively low cost. However, when the domain is not closed or is broader, or when the number of domains increases, the cost for constructing and maintaining such prompt systems increases greatly.
A general-purpose TTS system is preferred instead. However, general-purpose TTS systems sometimes cannot generate high quality speech for some domains, especially when the domain mismatches the speech corpus that is used as the unit inventory. It would be desirable to have a general-purpose TTS system that can produce rather natural speech without domain restrictions and that can generate more natural speech for a specific domain after domain adaptation. Domain adaptation is a concept that has been explored in many research areas; however, few studies have been conducted in the context of TTS systems. Efficient domain adaptation of a general-purpose TTS can be accomplished through generation of an optimized script for collecting domain-specific speech.