Technical Field
The present invention relates to speech synthesis, and more particularly to a hybrid parametric/exemplar-based predictive model for enhancing prosodic expressiveness for speech synthesis.
Description of the Related Art
Prosody is an inherent feature of spoken languages realized by the pitch, stress duration and other features in speech. Data-driven speech synthesis systems can be broadly contrasted in terms of the ways in which they make use of the data during the learning and run-time stages of the process to infer and predict prosodic properties of the acoustic waveform. For unit-selection systems, typical architectures exploit prosodic models to generate desired target values to use as a component of the cost function driving the unit search. At the other end of the continuum, fully parametric, model-based systems use training data only during the learning stage to adapt the model parameters and then use the models at run-time to generate prosodic parameters that can be used directly in the speech-generation stage. Since the data plays no further role after training, these systems incur a small footprint size, which is one of their desirable properties.
Fully parametric model-based systems usually rely on statistical averaging, leading to predicted prosody that suffers from low prosodic expressiveness due to flat intonation. On the other hand, exemplar-based models tend to be more expressive, but less robust, because their selection is based on low-level features or high-dimensional features.