In concatenative text-to-speech synthesis, the speech waveform corresponding to a given sequence of phonemes is generated by concatenating pre-recorded segments of speech. These segments are extracted from carefully selected sentences uttered by a professional speaker, and stored in a database known as a voice table. Each such segment is typically referred to as a unit. A unit may be a phoneme, a diphone (the span between the middle of a phoneme and the middle of another), or a sequence thereof. A phoneme is a phonetic unit in a language that corresponds to a set of similar speech realizations (like the velar \k\ of cool and the palatal \k\ of keel) perceived to be a single distinctive sound in the language.
The quality of the synthetic speech resulting from concatenative text-to-speech (TTS) synthesis is heavily dependent on the underlying inventory of units. A great deal of attention is typically paid to issues such as coverage (i.e. whether all possible units represented in the voice table), consistency (i.e. whether the speaker is adhering to the same style throughout the recording process), and recording quality (i.e. whether the signal-to-noise ratio is as high as possible at all times). However, an important aspect of the unit inventory relates to unit boundaries, i.e. how the segments are cut after recording. This aspect is important because the defined boundaries influence the degree of discontinuity after concatenation, and therefore how natural the synthetic speech will sound. Early TTS systems based on phoneme units had difficulty ensuring a good transition between two phonemes due to coarticulation effects. Systems based on diphone units, or sequences thereof, are generally better since there is typically less coarticulation at the ensuing concatenation points. Nevertheless, the finite size of the unit inventory implies that discontinuities are inevitable. As a result, minimizing their number and salience is important in concatenative TTS.
In diphone synthesis, the number of diphone units is small enough (e.g. about 2000 in English) to enable manual boundary optimization. In that case, the unit boundaries are adjusted manually so as to achieve, on the average, as good a concatenation as possible given any possible pair of compatible diphones. This tends to eliminate the most egregious discontinuities, but typically introduces many compromises which may degrade naturalness. In contrast, polyphone synthesis allows multiple instances of every unit, usually recorded under complementary, carefully controlled conditions. Due to the much larger size of the unit inventory, adjusting unit boundaries manually is no longer feasible.