Prosody refers to characteristics that contribute to the melodic and rhythmic vividness of speech. Some examples of these characteristics include pitch, loudness, and syllabic duration. Concatenative speech synthesis systems that use a small unit inventory typically have a prosody-prediction component (as well as other signal manipulation techniques). But such a prosody-prediction component is generally not able to recreate the prosodic richness found in natural speech. As a result, the prosody of these systems is too dull to be convincingly human.
One previous approach to prosody generation used instance-based learning techniques for classification [See, for example, “Machine Learning”, Tom M. Mitchell, McGraw-Hill Series in Computer Science, 1997; incorporated herein by reference]. In contrast to learning methods that construct a general explicit description of the target function when training examples are provided, instance-based learning methods simply store the training examples. Generalizing beyond these examples is postponed until a new instance must be classified. Each time a new query instance is encountered, its relationship to the previously stored examples is examined in order to assign a target function value for the new instance. The family of instance-based learning includes nearest neighbor and locally weighted regression methods that assume instances can be represented as points in a Euclidean space. It also includes case-based reasoning methods that use more complex, symbolic representations for instances. A key advantage to this kind of delayed, or lazy, learning is that instead of estimating the target function once for the entire space, these methods can estimate it locally and differently for each new instance to be classified.
One specific approach to prosody generation using instance-based learning was described in F. Malfrère, T. Dutoit, P. Mertens, “Automatic Prosody Generation Using Suprasegmental Unit Selection,” in Proc. of ESCA/COCOSDA Workshop on Speech Synthesis, Jenolan Caves, Australia, 1998; incorporated herein by reference. A system is described that uses prosodic databases extracted from natural speech to generate the rhythm and intonation of texts written in French. The rhythm of the synthetic speech is generated with a CART tree trained on a large mono-speaker speech corpus. The acoustic aspect of the intonation is derived from the same speech corpus. At synthesis time, patterns are chosen on the fly from the database so as to minimize a total selection cost composed of a pattern target cost and a pattern concatenation cost. The patterns that are used in the selection mechanism describe intonation on a symbolic level as a series of accent types. The elementary units that are used for intonation generation are intonational groups which consist of a sequence of syllables. This prosody generation algorithm is currently freely available from the EULER framework for the development of TTS systems for non-commercial and non-military applications at http://tcts.fpms.ac.be/synthesis/euler.
U.S. Pat. No. 5,905,972 “Prosodic Databases Holding Fundamental Frequency Templates For Use In Speech Synthesis” (incorporated herein by reference) describes an algorithm that is very similar to the one in Malfrère et al. Prosodic templates are identified by a tonal emphasis marker pattern, which is matched with a pattern that is predicted from text. The patterns (or templates) consist of a sequence of tonal markings applied on syllables: high emphasis, low emphasis, no special emphasis. Only fundamental frequency (f0) contours are generated by this method, no phoneme duration.