Speech synthesis seeks to model actions of the human vocal tract to one degree of detail or another. Typically, conventional speech synthesis systems, for example, resonance, vocal-tract and LPC synthesizers, use sets of equations to compute a next sample sound from a given input, or source, and a short list of previous outputs. In resonance synthesizers, for example, there are sets of equations for each resonance below 4 kHz. In vocal-tract and LPC synthesizers, for example, sets of equations are used to describe various sounds at different places in the human vocal-tract.
Because human muscle tissue changes shape slowly by comparison to the durations of speech sounds, the human vocal tract operates to produce smooth transitions from one speech state to another. Accordingly, it is not enough for conventional synthesizers to string together sequences of steady invariant sounds. For one thing, abrupt jumps between sounds create distracting non-speech-like clicks and pops. For another, much of the identity of consonants, as well as some vowels, are conveyed, not by steady states, but by the manner of change from one state of speech sound to the next. Nuances in the character of various speech elements convey sentence structure, emphasis, and a host of less tangible communications, such as, for example, happiness, determination, skepticism, etc. Further, details with no direct communicative value may still be important, as any audible deviation from what listeners expect is a distraction, or worse, a misdirection. Sounding natural and pleasant therefore requires being correct as to great detail. Approaches to reproducing transitional details in speech synthesis typically follow one of two methods, transitions, either by rule, or by use of stored data.
The rules approach is used by many commercial synthesizers, and it describes transitions between speech elements as geometric curves plotted against time. The rules approach can describe the motions of vocal-tract resonances, or motions of the tongue, lips, jaw, etc. The stored-data approach, by comparison, typically records and analyzes natural speech, and excerpts from that examples of transitions between speech element pairs, or more generally, sequences beginning with 1/2 of one speech element and ending with 1/2 of another. Both approaches have several problems, including, being constrained to reproducing only first-order interactions between adjacent speech elements, as well as strict rules for reproducing each speech element failing to appreciate the variance in real language speech elements due to stress and situation relative to syllable and word boundaries. The rules approach typically settles for a simplistic representation of excitation, in part, because the transient behavior of excitation appears to be too complex to describe by a rule. In contrast, the stored-data approach reproduces these transitions, but only for cases stored to a processing system which are inherently limited by the quantity of marked and collected combinations of speech elements, stress and boundary examples, and context, not to mention the processing resources and storage devices available. The foregoing problems and constraints remain a dominant obstacle to producing accurate, and hence, commercially desirable, speech synthesizers.