Text-to-speech (TTS) systems generate output speech based upon input text. FIG. 1 depicts a representative conventional TTS system 100 which performs concatenative speech generation. In representative system 100, input text 105 (e.g., received from a user, an application, or one or more other entities) is processed by linguistic analysis component (LAC) 110 to generate phonetic transcription 115. Unit selection module 120 processes the phonetic transcription generated by LAC 110 to select speech units from speech base 125 that correspond to the sounds (e.g., phonemes) in the phonetic transcription and concatenates those speech units to generate speech output 130.
Conventional TTS systems may be capable of generating output speech in different styles. A style of speech is defined mainly by the tone, attitude and/or mood which the speech adopts toward a subject to which it is directed. For example, a didactic speech style is typically characterized by a slow, calm tone which an adult would typically adopt in teaching a child, with pauses interspersed between spoken words to enhance intelligibility. Other speech styles which conventional TTS systems may generate include neutral, joyful, sad and ironic speech styles.
A speech style is characterized to some extent by a combination of underlying speech parameters (e.g., speech rate, volume, duration, pitch height, pitch range, intonation, rhythm, the presence or absence of pauses, etc.), and how those parameters vary over time, both within words and across multiple words. However, while speech in a first style may be characterized by a different range of values for a specific parameter than speech in a second style (e.g., speech in a joyful style may have a faster speech rate than speech in a neutral style), simply modifying the speech in the first style to exhibit the parameter values characteristic of the second style does not result in speech in the second style being produced (e.g., one cannot produce speech in a joyful style simply speeding up speech in a neutral style).
Conventional concatenative TTS systems generate speech output in more than one style by employing a different “voice” for each style, with each “voice” having an associated style-specific linguistic analysis component (LAC) and speech base. A style-specific linguistic analysis component may include programmatically implemented linguistic rules relating to a particular speech style. A style-specific speech base may store speech units generated from recordings of a speaker speaking in the particular speech style, or derivations of such recordings (e.g., produced by applying filters, pitch modifications or other post-processing).
A representative conventional concatenative TTS architecture 200 operative to generate output speech in neutral, joyful or didactic styles is depicted in FIG. 2. Architecture 200 includes systems 200A, 200B and 200C, with system 200A being operative to generate speech in a neutral style, 200B being operative to generate speech in a joyful style, and 200C being operative to generate speech in a didactic style. Each system includes an associated style-specific linguistic analysis component (LAC) and speech base. Thus, system 200A includes neutral style-specific linguistic analysis component (LAC) 210A and neutral style-specific speech base 225A. Similarly, system 200B includes joyful style-specific linguistic analysis component (LAC) 210B and joyful style-specific speech base 225B, and system 200C includes didactic style-specific linguistic analysis component (LAC) 210C and didactic style-specific speech base 225C. Linguistic analysis components 210A, 210B, 210C process respective input text 205A, 205B and 205C to generate phonetic transcriptions 215A, 215B and 215C. The phonetic transcriptions are processed by respective unit selection modules 220A, 220B, 220C to generate speech output. That is, unit selection 220A processes phonetic transcription 215A to select and concatenate speech units from neutral style-specific speech base 225A to produce neutral speech output 230A, unit selection 220B processes phonetic transcription 215B to select and concatenate speech units from joyful style-specific speech base 225B to produce joyful speech output 230B, and unit selection 220C processes phonetic transcription 215C to select and concatenate speech units from didactic style-specific speech base 225C to produce didactic speech output 230C.