1. Field of the Invention
The invention relates to speech processing based primarily on data-driven modules.
2. Description of the Related Art
Methods and systems for speech processing are known, for example, from U.S. Pat. Nos. 6,029,135, 5,732,388 and German Patentschrift Nos. 19636739 C1 and 19719381 C1. The implementation of multilingual and speech-space-independent speech synthesis systems is, in particular, based primarily on data-driven modules. These modules, for example prosody generation modules, generally use learning methods. In general, the learning methods can be used well for a number of languages and applications. However, the input variables often need to be optimized laboriously by hand.
The following learning techniques have been used for the case of symbolic prosody, that is to say in particular for phrase boundary prediction and to predict accented words, for example by appropriate fundamental frequency production: phrase boundary prediction is carried out on the basis of rules based on classification and regression trees (CARTs) from Julia Hirschberg and Pilar Prieto: “Training Intonational Phrasing Rules Automatically for English and Spanish Text-to-speech”, Speech Communication, 18, pages 281-290, 1996, and Michell Q. Wang and Julia Hirschberg: “Automatic Classification of Intonational Phrasing Boundaries” Computer Speech and Language, 6, pages 175-196, 1992, and rules which are based on Hidden Markov models (HMM) from Alan W. Black and Paul Taylor: “Assigning Phrase Breaks from Part-of-Speech Sequences”, Eurospeech, 1997, and rules which are based on neural networks from Achim F. Müller, Hans Georg Zimmermann and Ralf Neuneier: “Robust Generation of Symbolic Prosody by a Neural Classifier Based on autoassociators”, ICASSP, 2000. CARTs were used by Julia Hirschberg for the prediction of accents or accented words: “Pitch Accent in Context: Predicting Prominence from Text”, Artificial Intelligence, 63, pages 305-340, 1993, while, in contrast, neural networks have been used by Christina Widera, Thomas Portele and Maria Wolters: “Prediction of Word Prominence”, Eurospeech 1997. Interpretation of the influence of the input variables used is generally impossible in this case. This applies in particular to neural networks. This problem is also known for the case of fundamental frequency production (f0 generation). Thus, for example, the input variables are optimized heuristically in Gerit P. Sonntag, Thomas Portele and Barbara Heruft: “Prosody Generation with a Neural Network: Weighing the Importance of Input Parameters”, ICASSP, 1997.