The present invention relates to concatenative speech synthesis systems. In particular, the invention relates to a system and method for identifying appropriate edge boundary regions for concatenating speech units. The system employs a speech unit database populated using speech unit models.
Concatenative speech synthesis exists in a number of different forms today, which depend on how the concatenative speech units are stored and processed. These forms include time domain waveform representations, frequency domain representations (such as a formants representation or a linear predictive coding LPC representation) or some combination of these.
Regardless of the form of speech unit, concatenative synthesis is performed by identifying appropriate boundary regions at the edges of each unit, where units can be smoothly overlapped to synthesize new sound units, including words and phrases. Speech units in concatenative synthesis systems are typically diphones or demisyllables. As such, their boundary overlap regions are phoneme-medial. Thus, for example, the word "tool" could be assembled from the units `tu` and `ul` derived from the words "tooth" and "fool." What must be determined is how much of the source words should be saved in the speech units, and how much they should overlap when put together.
In prior work on concatenative text-to-speech (TTS) systems, a number of methods have been employed to determine overlap regions. In the design of such systems, three factors come into consideration:
Seamless Concatenation: Overlapping to speech units should provide a smooth enough transition between one unit and the next that no abrupt change can be heard. Listeners should have no idea that the speech they are hearing is being assembled from pieces. PA1 Distortion-free Transition: Overlapping to speech units should not introduce any distortion of its own. Units should be mixed in such a way that the result is indistinguishable from non-overlapped speech. PA1 Minimal System Load: The computational and/or storage requirements imposed on the synthesizer should be as small as possible.
In current systems there is a tradeoff between these three goals. No system is optimal with respect to all three. Current approaches can generally be grouped according to two choices they make in balancing these goals. The first is whether they employ short or long overlap regions. A short overlap can be as quick as a single glottal pulse, while a long overlap can comprise the bulk of an entire phoneme. The second choice involves whether the overlap regions are consistent or allowed to vary contextually. In the former case, like portions of each sound unit are overlapped with the preceding and following units, regardless of what those units are. In the latter case, the portions used are varied each time the unit is used, depending on adjacent units.
Long overlap has the advantage of making transitions between units more seamless, because there is more time to iron out subtle differences between them. However, long overlaps are prone to create distortion. Distortion results from mixing unlike signals.
Short overlap has the advantage of minimizing distortion. With short overlap it is easier to ensure that the overlapping portions are well matched. Short overlapping regions can be approximately characterized as instantaneous states (as opposed to dynamically varying states). However, short overlap sacrifices seamless concatenation found in long overlap systems.
While it would be desirable to have the seamlessness of long overlap techniques and the low distortion of short overlap techniques, to date no systems have been able to achieve this. Some contemporary systems have experimented with using variable overlap regions in an effort to minimize distortion while retaining the benefits of long overlap. However, such systems rely heavily on computationally expensive processing, making them impractical for many applications.
The present invention employs a statistical modeling technique to identify the nuclear trajectory regions within sound units and these regions are then used to identify the optimal overlap boundaries. In the presently preferred embodiment time-series data is statistically modeled using Hidden Markov Models that are constructed on the phoneme region of each sound unit and then optimally aligned through training or embedded re-estimation.
In the preferred embodiment, the initial and final phoneme of each sound unit is considered to consist of three elements: the nuclear trajectory, a transition element preceding the nuclear region and a transition element following the nuclear region. The modeling process optimally identifies these three elements, such that the nuclear trajectory region remains relatively consistent for all instances of the phoneme in question.
With the nuclear trajectory region identified, the beginning and ending boundaries of the nuclear region serve to delimit the overlap region that is thereafter used for concatenative synthesis.
The presently preferred implementation employs a statistical model that has a data structure for separately modeling the nuclear trajectory region of a vowel, a first transition element preceding the nuclear trajectory region and a second transition element following the nuclear trajectory region. The data structure may be used to discard a portion of the sound unit data, corresponding to that portion of the sound unit that will not be used during the concatenation process.
The invention has a number of advantages and uses. It may be used as a basis for automated construction of speech unit databases for concatenative speech synthesis systems. The automated techniques both improve the quality of derived synthesized speech and save a significant amount of labor in the database collection process.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.