Conventional CTTS systems use a database of speech segments (e.g., phonemes, syllables, and/or entire words) recorded from a single speaker to select speech segments to concatenate based on some input text string. In order to achieve high-quality synthetic speech, however, a large amount of data need be collected from the single speaker; thus making the development of such a database time-consuming and costly.
Reference with regard to some conventional approaches may be had, for example, to U.S. Pat. No. 6,725,199 B2, “Speech Synthesis Apparatus and Selection Method”, Brittan et al.; U.S. Pat. No. 5,878,393, “High Quality Concatenative Reading System”, Hata et al.; and U.S. Pat. No. 5,860,064, “Method and Apparatus for Automatic Generation of Vocal Emotion in a Synthetic Text-to-Speech System”, Caroline G. Henton. For example, the system described in U.S. Pat. No. 5,878,393 employs a dictionary of sampled sounds, where the dictionary may include separate dictionaries of sounds sampled at different sampling rates. The dictionary may also store all pronunciation variants of a word for each of a plurality of prosodic environments.
New domains for deploying text-to-speech invariably arise, usually accompanied by a desire to supplement the database of recordings used to build a CTTS system with additional data corresponding to words, phrases and/or sentences which are highly relevant to the new domain, such as specific company names or technical phrases not present in the original script.
However, in the event that the original speaker whose voice was recorded and sampled to populate the dictionary is no longer available to make an additional recording, a new speaker may be required to re-record all of the original script, in addition to the new domain-specific script. Such a process would not be efficient for a number of reasons.