1. Field of the Invention
The present invention relates to text-to-speech generation and more specifically to a method for predicting prosodic parameters from preprocessed text using a bootstrapping method.
2. Discussion of Related Art
The present invention relates to an improved process for automating prosodic labeling in a text-to-speech (TTS) system. As is known, a typical spoken dialog service includes some basic modules for receiving speech from a person and generating a response. For example, most such systems include an automatic speech recognition (ASR) module to recognize the speech provided by the user, a natural language understanding (NLU) module that receives the text from the ASR module to determine the substance or meaning of the speech, a dialog management (DM) module that receives the interpretation of the speech from the NLU module and generates a response, and a TTS module that receives the generated text from the DM module and generates synthetic speech to “speak” the response to the user.
Each TTS system first analyzes input text in order to identify what the speech should sound like before generating an output waveform. Text analysis includes part-of-speech (POS) tagging, text normalization, grapheme-to-phoneme conversion, and prosody prediction. Prosody prediction itself often consists of two steps: First, a symbolic description is generated, which indicates the locations of accents and prosodic phrase boundaries (or simply “boundaries”). More information regarding predicting accents and prosodic boundaries or pauses may be found in X. Huang, A. Acero, H. Hon, Spoken Language Processing, Prentice Hall, 2001, pages 739-782, incorporated herein by reference.
Frequently, the symbols used for prosody prediction are Tone and Break Indices (“ToBI”) labels, which are also an abstract description of an F0 (fundamental frequency) contour. See, e.g., K. Silverman, M. Beckman, J. Pitrelli, M. Osterndorf, C. Wightman, P. Price, I. Pierrehumbert, and I. Hirschberg, “Tobi: A standard for labeling English prosody,” in Proc. Int. Conf on Spoken Language Processing, 1992, pp. 867-870, incorporated herein by reference. The second step involves using the ToBI labels to calculate numerical F0 values and phone durations. The rational behind this two-step approach is the belief that linguistic features are more strongly correlated with symbolic prosody than with the acoustic realization. This not only makes it easier for a human to write rules that predict prosody, it also makes it easier for a machine to learn these rules from a database.
Unfortunately, ToBI labeling is very slow and expensive. See, A. Syrdal and I. Hirschberg, “Automatic ToBI prediction and alignment to speed manual labeling of prosody,” Speech Communication, Special Issue on Speech Annotation and Corpus Tools, 2001, no. 33, pp. 135-151. Having several labelers available may speed it up, but it does not address the cost factor and other issues such as inter-labeler inconsistency. Therefore, a more automatic procedure is highly desirable.
FIG. 1 illustrates a known method of prosody prediction using ToBI prediction and alignment. This method involves receiving speech files annotated with orthography (words and punctuation), pronunciation (phones and their duration, word and syllable boundaries, lexical stress and other parts of speech (102). Other linguistic features are added that are relevant to prosody such as “is a yes/no question?” (104). For each word or syllable, the method predicts symbolic prosody (e.g. ToBI) by applying a rule set such as a classification and regression tree (CART) (106). For each phone, the method predicts its duration by applying a rule set (108) and for each syllable, predicts parameters describing the syllable's F0 contour (110). Finally, from the contour parameters and phone duration, the method involves calculating the actual F0 contour (112).
Known prosody-prediction modules are based on a rule set that are manually written or generated according to machine learning techniques by adapting a few parameters to create all the rules. When the rules are derived from training data by applying machine learning methods, the training data needs to be labeled prosodically. The known method of labeling the training data prosodically is a manual process. What is needed in the art is a method of automating the process of creating prosodic labels without expensive, manual intervention.