Numerous applications require the modeling of phenomena which are smooth and subject to constraints. A notable example of such an application is a text to speech system. Generation of speech is typically smooth because the muscles used to produce speech have a nonzero mass and therefore cannot be subjected to instantaneous acceleration. Moreover, the particular size, shape, placement and other properties of the muscles producing speech impose constraints on the speech which can be produced. A text to speech system preferably produces speech which changes smoothly and constrains the speech which is produced so that the speech sounds as natural as possible.
A text to speech system receives text inputs, typically words and sentences, and converts these inputs into spoken words and sentences. The text to speech system employs a model of specific speaker's speech to construct an inventory of speech units and models of prosody in response to each pronounceable unit of text. Prosodic characteristics of speech are the rhythmic and intonational characteristics of speech. The system then arranges the speech units into the sequence represented by the text and plays the sequence of speech units. A typical text to speech system performs text analysis to predict phone sequences, duration modeling to predict the length of each phone, intonation modeling to predict pitch contours and signal processing to combine the results of the different analyses and modules in order to create speech sounds.
Many prior art text to speech systems deduce prosodic information from the text from which speech is to be generated. Prosodic information includes speech rhythms, pitches, accents, volume and other characteristics. The text typically includes little information from which prosodic information can be deduced. Therefore, prior art text to speech systems tend to be designed conservatively. A conservatively designed system will produce a neutral prosody if the correct prosody cannot be determined, on the theory that a neutral prosody is superior to an incorrect one. Consequently, the prosody model tends to be designed conservatively as well, and does not have the capability to model prosodic variations found in natural speech. The ability to model variations such as occur in natural speech is essential in order to match any given pitch contours, or to convey a wide range of effects such as personal speaking styles and emotions. The lack of such variations in speech produced by prior art text to speech systems contributes strongly to an artificial sound produced by many such systems.
In many applications, it is desirable to use text to speech systems which can carry on a dialog. For example, a text to speech system may be used to produce speech for a telephone menu system which provides spoken responses to customer inputs. Such a system may suitably include state information corresponding to concepts, goals and intentions. For example, if a system produces a set of words which represents a single proper noun, such as “Wells Fargo Bank,” the generated speech should include sound characteristics conveying that the set of words is a single noun. In other cases, the impression may need to be conveyed that a word is particularly important, or that a word needs confirmation. In order to convey correct impressions, the generated speech must have appropriate prosodic characteristics. Prosodic characteristics which may advantageously be defined for the generated speech include pitch, amplitude, and any other characteristics needed to give the speech a natural sound and convey the desired impressions.
There exists, therefore, a need for a system of tags which can define phenomena, such as the prosodic characteristics of speech, in sufficient detail to model the phenomena such that they speech have the desired characteristics, and a system for processing tags in order to produce phenomena having the characteristics defined by the tags.