Duration model predicts the reasonable duration of speech units according to its linguistic and phonetic attributes. Prior traditional methods include SOP (Sum of Products), CART (Classify and Regression Tree) and ANN (Artificial Neural Networks).
The Sum of Products (SOP) has been described in detail, for example, in the article “An RNN-based prosodic information synthesizer for Mandarin text-to-speech”, S. H. Chen, S. H. Hwang et al, IEEE trans. Speech Audio Processing, Vol. 6, No. 3, pp 226-239, 1998, and in the article “Polynomial regression model for duration prediction in Mandarin”, Sun Lu, Yu Hu, Ren-Hua Wang, INTERSPEECH-2004, pp 769-77.
The Classify and Regression Tree (CART) has been described in detail, for example, in the article “Linguistic factors affecting timing in Korean with application to speech synthesis”, Chung, H. and Huckvale, M. A., Proceedings of Eurospeech 2001, Aalborg, vol. 2, pp. 815-819.
The Artificial Neural Networks (ANN) has been described in detail, for example, in the article “Modeling vowel duration for Japanese text-to-speech synthesis”, Venditti, Jennifer J., Santen, Jan P. H. van, ICSLP-1998, pp. 786-789. All of which are incorporated herein by reference.
However, the traditional methods have following shortcomings:
1) The traditional methods are assailed by two main problems, data sparsity and attributes interaction. These are mainly caused by the imbalance between model complexity and database size. The existing models' coefficients can be computed by the data driven method. But the attributes and attributes combinations are selected manually instead of being selected by data driven method. So these “partially” data driven modeling methods depend on subjective empiricism.
2) Speaking rate is not introduced as an attribute for duration modeling. But segmental duration is obviously affected by speaking rate from the existing prosody researches. Thus, speech synthesizer has no choice but to linearly shorten or lengthen the segmental durations when users need to adjust speaking rate. But in fact, effects of different attributes on segmental durations differ widely, so it's not reasonable to do linear shortening and lengthening.