Concatenative text-to-speech (“TTS”) synthesis generates the speech waveform corresponding to a given sequence of phonemes through the sequential assembly of pre-recorded segments of speech. These segments may be extracted from sentences uttered by a professional speaker, and stored in a database. Each such segment is usually referred to as a unit. During synthesis, the database may be searched for the most appropriate unit to be spoken at any given time, a process known as unit selection. This selection typically relies on a plurality of characteristics reflecting, for example, the degree of discontinuity from the previous unit, the departure from ideal values for pitch and duration, the spectral quality relative to the average matching unit present in the database, the location of the candidate unit in the recorded utterance, etc.
To select the unit, two requirements need to be fulfilled: (i) each individual characteristic needs to meaningfully score each potential candidate relative to all other available candidates, and (ii) these individual scores needs to be appropriately combined into a final score, which then may serve as the basis for unit selection.
The typical approaches to achieve requirement (ii) have been to consider a linear combination of the various scores, where the weights are empirically determined via careful human listening. In that case the synthesized material is inherently limited to a tractably small number of sentences, sometimes not even particularly representative of the eventual (unknown) domain of use. That is, in the existing techniques, the weights are manually tuned in a global fashion by listening to a necessarily small amount of synthesized material. Additionally, the existing techniques define weightings for the entire corpus of samples and apply those defined weightings across all samples.
These strategies have obvious drawbacks, including a lack of scalability and the need for human supervision. Most importantly, they often lead to a set of weights which fails to generalize beyond the initial set of sentences considered. In other words, in the existing techniques there is no guarantee that the weights obtained by “trial and error” approach will generalize to new material. In fact, because no single combination of scores can possibly be optimal for all concatenations, these techniques are essentially counter-productive.
Alternatively, it is also possible to view each scoring source as generating a separate stream of information, and apply standard voting methods and other known learning/classification techniques to try to combine the ensuing outcomes. Unfortunately, the various streams tend to (i) be correlated with each other in complex, time-varying ways, and (ii) differ unpredictably in their discriminative value depending on context, thereby violating many of the assumptions implicitly underlying such techniques.