The present invention relates generally to text-to-speech (tts) systems and speech synthesis. More particularly, the invention relates to a system for generating duration templates which can be used in a text-to-speech system to provide more natural sounding speech synthesis.
The task of generating natural human-sounding prosody for text-to-speech and speech synthesis has historically been one of the most challenging problems that researchers and developers have had to face. Text-to-speech systems have in general become infamous for their unnatural prosody such as "robotic" intonations or incorrect sentence rhythm and timing. To address this problem some prior systems have used neural networks and vector clustering algorithms in an attempt to simulate natural sounding prosody. Aside from being only marginally successful, these "black box" computational techniques give the developer no feedback regarding what the crucial parameters are for natural sounding prosody.
The present invention builds upon a different approach which was disclosed in a prior patent application entitled "Speech Synthesis Employing Prosody Templates". In the disclosed approach, samples of actual human speech are used to develop prosody templates. The templates define a relationship between syllabic stress patterns and certain prosodic variables such as intonation (F0) and duration, especially focusing on F0 templates. Thus, unlike prior algorithmic approaches, the disclosed approach uses naturally occurring lexical and acoustic attributes (e.g., stress pattern, number of syllables, intonation, duration) that can be directly observed and understood by the researcher or developer.
The previously disclosed approach stores the prosody templates for intonation (F0) and duration information in a database that is accessed by specifying the number of syllables and stress pattern associated with a given word. A word dictionary is provided to supply the system with the requisite information concerning number of syllables and stress patterns. The text processor generates phonemic representations of input words, using the word dictionary to identify the stress pattern of the input words. A prosody module then accesses the database of templates, using the number of syllables and stress pattern information to access the database. A prosody template for the given word is then obtained from the database and used to supply prosody information to the sound generation module that generates synthesized speech based on the phonemic representation and the prosody information.
The previously disclosed approach focuses on speech at the word level. Words are subdivided into syllables and thus represent the basic unit of prosody. The stress pattern defined by the syllables determines the most perceptually important characteristics of both intonation (F0) and duration. At this level of granularity, the template set is quite small in size and easily implemented in text-to-speech and speech synthesis systems. While a word level prosodic analysis using syllables is presently preferred, the prosody template techniques of the invention can be used in systems exhibiting other levels of granularity. For example, the template set can be expanded to allow for more grouping features, both at the sentence and word level. In this regard, duration modification (e.g. lengthening) caused by phrase or sentence position and type, segmental structure in a syllable, and phonetic representation can be used as attributes with which to categorize certain prosodic patterns.
Although text-to-speech systems based upon prosody templates that are derived from samples of actual human speech have held out the promise of greatly improved speech synthesis, those systems have been limited by the difficulty of constructing suitable duration templates. To obtain temporal prosody patterns the purely segmental timing quantities must be factored out from the larger scale prosodic effects. This has proven to be much more difficult than constructing F0 templates, wherein intonation information can be obtained by visually examining individual F0 data.
The present invention presents a method of separating high-level prosodic behavior from purely articulatory constraints so that high-level timing information can be extracted from human speech. The extracted timing information is used to construct duration templates that are employed for speech synthesis. Initially, the words of input text are segmented into phonemes and syllables and the associated stress pattern is assigned. The stress assigned words can then be assigned grouping features by a text grouping module. A phoneme cluster module groups the phonemes into phoneme pairs and single phonemes. A static duration associated with each phoneme pair and single phoneme is retrieved from a global static table. A normalization module generates a normalized duration value for a syllable based upon lengthening or shortening of the global static durations associated with the phonemes that comprise the syllable. The normalized duration value is stored in a duration template based upon the grouping features associated with that syllable.
For a more complete understanding of the invention, its objectives and advantages, refer to the following specification and to the accompanying drawings.