1. Field of the Invention
The invention generally relates to the field of speech synthesis and, more particularly, to a system and method for compressing concatenative acoustic inventories for speech.
2. Description of the Related Art
Concatenative speech synthesis is used for various types of speech synthesis applications including text-to-speech and voice response systems. Most text-to-speech conversion systems convert an input text string into a corresponding string of linguistic units such as consonants and vowel phonemes, or phoneme variants such as allophones, diphones, or triphones. An allophone is a variant of the phoneme based on surrounding sounds. For example, the aspirated p of the word pawn and the unaspirated p of the word spawn are both allophones of the phoneme p. Phonemes are the basic building blocks of speech corresponding to the sounds of a particular language or dialect. Diphones and triphones are sequences of phonemes and are related to allophones in that the pronunciation of each of the phonemes depend on the other phonemes, diphones or triphones.
Diphone synthesis and acoustic unit selection synthesis (concatenative speech synthesis) are two categories of speech synthesis techniques which are frequently used today. Concatenative speech synthesis techniques involve concatenating diphone phonetic sequences obtained from recorded speech to form new words and sentences. Such concatenative synthesis uses actual pre-recorded speech to form a large database, or corpus which is segmented based on phonological features of a language. Commonly, the phonological features include transitions from one phoneme to at least one other phoneme. For instance, the phonemes can be segmented into diphone units, syllables or even words.
Diphone concatenation systems are particularly prominent. A diphone is an acoustic unit which extends from the middle of one phoneme to the middle of the next phoneme. In other words, the diphone includes the transition between each partial phoneme. It is generally believed that synthesis using concatenation of diphones provides a reproduced voice of high quality, since each diphone is concatenated with adjoining diphones at the point where the beginning and the ending phonemes have reached steady state, and since each diphone records the actual transition from phoneme to phoneme.
In diphone synthesis, a diphone is defined as the second half of one phoneme followed by the initial half of the following phoneme. At the cost of having N.times.N (capital N being the number of phonemes in a language or dialect) speech recordings, i.e., diphones in a database, high quality synthesis can be achieved. For example, in English, N would equal between 40–45 phonemes depending on regional accents and the definition of the phoneme set. An appropriate sequence of diphones is concatenated into one continuous signal using a variety of techniques (e.g., time-domain Pitch Synchronous Overlap and Add (TD-PSOLA)).
This approach does not, however, completely solve the problem of providing smooth concatenations, nor does it solve the problem of generating synthetic speech which sounds natural. Generally, there is some spectral envelope mismatch at the concatenation boundaries. For severe cases, depending on how the signals are treated, a speech signal may exhibit glitches, or degradation in the clarity of the speech signal may occur. Consequently, a great deal of effort is often expended to choose appropriate diphone units that will not possess such defects, irrespective of which other units they are matched with. Thus, in general, a considerable effort is devoted to preparing a diphone set and selecting sequences that are suitable for recording and to verifying that the recordings are suitable for the diphone set.
In addition to the foregoing problems, other significant problems exist in conventional diphone concatenation systems. In order to achieve a suitable concatenation system, a minimum of 1500 to 2000 individual diphones must be used. When segmented from pre-recorded continuous speech, suitable diphones may be unobtainable because many phonemes (where concatenation is to take place) have not reached a steady state. Thus, a mismatch or distortion can occur from phoneme to phoneme at the point where the diphones are concatenated together. To reduce this distortion, conventional diphone concatenative synthesizers, as well as others, often select their units from carrier sentences or monotone speech and/or often perform spectral smoothing. As a result, a decrease in the naturalness of the speech can occur. Consequently, the synthesized speech may not resemble the original speech.
Another approach to concatenative synthesis is unit selection synthesis. Here, a very large database for recorded speech that has been segmented and labeled with prosodic and spectral characteristics is used, such as the fundamental frequency (F0) for voiced speech, the energy or gain of the signal, and the spectral distribution of the signal (i.e., how much of the signal is present at any given frequency). The database contains multiple instances of phoneme sequences. This permits the possibility of having units in the database which are much less stylized than would occur in a diphone database where generally only one instance of any given diphone is assumed. As a result, the ability to achieve natural sounding speech is enhanced.
A key problem in either of the prior approaches is that acoustic units require a substantial storage space. This is true regardless of whether the acoustic units are obtained from a database during diphone synthesis or if the acoustic units are permitted to remain in an actual database during concatenative synthesis.