The field of sound synthesis, and in particular speech synthesis, has received less attention historically than fields such as speech recognition. This may be because early in the research process, the problem of generating intelligible speech was solved, while the problem of recognition is only now being solved. However, these traditional speech synthesis solutions still suffer from many disadvantages. For example, conventional speech synthesis systems are difficult and tiring to listen to, can garble the meaning of an utterance, are inflexible, unchanging, unnatural-sounding and generally `robotic` sounding. These disadvantages stem from difficulties in reproducing or generating the subtle changes in pitch, cadence (segmental duration), and other vocal qualities (often referred to as prosodics) which characterize natural speech. The same is true of the transitions between speech segments themselves (formants, diphones, LPC parameters, etc.).
The traditional approaches in the art to generating these subtler qualities of speech tend to operate under the assumption that the small variations in quantities such as pitch and duration observed in natural human speech are just noise and can be discarded. As a result, these approaches have primarily used inflexible methods involving fixed formulas, rules and the concatenation of a relatively small set of prefigured geometric contour segments. These approaches thus eliminate or ignore what might be referred to as microprosody and other microvariations within small pieces of speech.
Recently, the art has seen some attempts to use learning machines to create more flexible systems which respond more reasonably to context and which generate somewhat more complex and evolving parameter (e.g., pitch) contours. For example, U.S. Pat. No. 5,668,926 issued to Karaali et al. describes such a system. However, these approaches are also flawed. First, they organize their learning architecture around fixed-width time slices, typically on the order of 10 ms per time slice. These fixed time segments, however, are not inherently or meaningfully related to speech or text. Second, they have difficulty making use of the context of any particular element of the speech: what context is present is represented at the same level as the fixed time slices, severely limiting the effective width of context that can be used at one time. Similarly, different levels of context are confused, making it difficult to exploit the strengths of each. Additionally, by marrying context to fixed-width time slices, the learning engine is not presented with a stable number of symbolic elements (e.g., phonemes or words.) over different patterns.
Finally, none of these models from the prior art attempt application of learning models to non-verbal sound modulation and generation, such as musical phrasing, non-lexical vocalizations, etc. Nor do they address the modulation and generation of emotional speech, voice quality variation (whisper, shout, gravelly, accent), etc.