The present invention relates generally to text-to-speech (tts) systems and speech synthesis. More particularly, the invention relates to a system for providing more natural sounding prosody through the use of prosody templates.
The task of generating natural human-sounding prosody for text-to-speech and speech synthesis has historically been one of the most challenging problems that researchers and developers have had to face. Text-to-speech systems have in general become infamous for their "robotic" intonations. To address this problem some prior systems have used neural networks and vector clustering algorithms in an attempt to simulate natural sounding prosody. Aside from being only marginally successful, these "black box" computational techniques give the developer no feedback regarding what the crucial parameters are for natural sounding prosody.
The present invention takes a different approach, in which samples of actual human speech are used to develop prosody templates. The templates define a relationship between syllabic stress patterns and certain prosodic variables such as intonation (F0) and duration. Thus, unlike prior algorithmic approaches, the invention uses naturally occurring lexical and acoustic attributes (e.g., stress pattern, number of syllables, intonation, duration) that can be directly observed and understood by the researcher or developer.
The presently preferred implementation stores the prosody templates in a database that is accessed by specifying the number of syllables and stress pattern associated with a given word. A word dictionary is provided to supply the system with the requisite information concerning number of syllables and stress patterns. The text processor generates phonemic representations of input words, using the word dictionary to identify the stress pattern of the input words. A prosody module then accesses the database of templates, using the number of syllables and stress pattern information to access the database. A prosody module for the given word is then obtained from the database and used to supply prosody information to the sound generation module that generates synthesized speech based on the phonemic representation and the prosody information.
The presently preferred implementation focuses on speech at the word level. Words are subdivided into syllables and thus represent the basic unit of prosody. The preferred system assumes that the stress pattern defined by the syllables determines the most perceptually important characteristics of both intonation (F0) and duration. At this level of granularity, the template set is quite small in size and easily implemented in text-to-speech and speech synthesis systems. While a word level prosodic analysis using syllables is presently preferred, the prosody template techniques of the invention can be used in systems exhibiting other levels of granularity. For example, the template set can be expanded to allow for more feature determiners, both at the syllable and word level. In this regard, microscopic F0 perturbations caused by consonant type, voicing, intrinsic pitch of vowels and segmental structure in a syllable can be used as attributes with which to categorize certain prosodic patterns. In addition, the techniques can be extended beyond the word level F0 contours and duration patterns to phrase-level and sentence-level analyses.
For a more complete understanding of the invention, its objectives and advantages, refer to the following specification and to the accompanying drawings.