In concatenative text-to-speech synthesis, the speech waveform corresponding to a given sequence of phonemes is generated by concatenating pre-recorded segments of speech. These segments are extracted from carefully selected sentences uttered by a professional speaker, and stored in a database known as a voice table. Each such segment is typically referred to as a unit. A unit may be a phoneme, a diphone (the span between the middle of a phoneme and the middle of another), or a sequence thereof. A phoneme is a phonetic unit in a language that corresponds to a set of similar speech realizations (like the velar \k\ of cool and the palatal \k\ of keel) perceived to be a single distinctive sound in the language. In diphone synthesis, the voice table contains exactly one exemplar of each possible diphone. This “canonical” exemplar is usually hand-picked from a suitable inventory by a trained acoustician, in order to maximize the perceived quality of the associated phoneme-to-phoneme transition. Although this solution is expedient in terms of data collection cost and memory footprint, it does, however, inherently limit the quality of the resulting synthetic speech, because no set of canonical diphones can possibly perform acceptably in all conceivable situations.
To make synthetic speech sound more natural, it is highly desirable to process longer speech segments, so as to reduce the number of discontinuities at segment boundaries. This is referred to as polyphone synthesis. In this approach, the voice table includes several exemplars of each diphone, each extracted from a different phrase. The voice table may also contain contiguity information to recover longer speech segments from which the diphones are extracted. At synthesis time, it is therefore necessary to select the most appropriate segment at a given point, a procedure known as unit selection. Unit selection is typically performed on the basis of two criteria: unit cost, and concatenation cost. Unit cost is related to the intrinsic properties of the unit, such as pitch and duration behavior, which tend to be relatively easy to quantify. Concatenation cost attempts to quantify the amount of perceived discontinuity with respect to the previous segment, and has proven considerably more difficult to quantify.
The concatenation cost between two segments S1 and S2 is typically computed via a metric d(S1, S2) defined on some appropriate features extracted from S1 and S2. Briefly, given two feature vectors (one associated with S1 and one with S2), some expression of the “difference” between the two is used as an estimate of the perceived discontinuity at the boundary between S1 and S2. Not surprisingly, the choice of features heavily influences the accuracy of this estimate. Conventional feature extraction involves such various features as Fast Fourier Transform (FFT) amplitude spectrum, perceptual spectrum, Linear Predictive Coding (LPC) coefficients, mel-frequency cepstral coefficients (MFCC), formant frequencies, or line spectral frequencies. All of these features are spectral in nature, meaning that they represent different ways to encapsulate the frequency content of the signal. This is motivated by a history of speech research underscoring the importance of spectral features to speech perception. Phase information, on the other hand, is typically ignored.