The present invention relates generally to generating speech using a concatenative synthesizer. More particularly, an apparatus and a method are disclosed for storing and generating speech using decision tree based context-dependent phonemes-based units that are clustered based on the contexts associated with the phonemes-based units.
Speech signal generators or synthesizers in a text-to-speech (TTS) system can be classified into three distinct categories: articulatory synthesizers; formant synthesizers; and concatenative synthesizers. Articulatory synthesizers are based on the physics of sound generation in the vocal apparatus. Individual parameters related to the position and movement of vocal chords are provided. The sound generated therefrom is determined according to physics. In view of the complexity of the physics, practical applications of this type of synthesizer are considered to be far off.
Formant synthesizers do not use equations of physics to generate speech, but rather, model acoustic features or the spectra of the speech signal, and use a set of rules to generate speech. In a formant synthesizer, a phoneme is modeled with formants wherein each formant has a distinct frequency "trajectory" and a distinct bandwidth which varies over the duration of the phoneme. An audio signal is synthesized by using the frequency and bandwidth trajectories to control a formant synthesizer. While the formant synthesizer can achieve high intelligibility, its "naturalness" is typically low, since it is very difficult to accurately describe the process of speech generation in a set of rules. In some systems, in order to mimic natural speech, the synthetic pronunciation of each phoneme is determined by a set of rules which analyzes the phonetic context of the phoneme. U.S. Pat. No. 4,979,216 issued to Malsheen et al. describes a text-to-speech synthesis system and method using context dependent vowel allophones.
Concatenation systems and methods for generating text-to-speech operate under an entirely different principle. Concatenative synthesis uses pre-recorded actual speech forming a large database or corpus. The corpus is segmented based on phonological features of a language. Commonly, the phonological features include transitions from one phoneme to at least one other phoneme. For instance, the phonemes can be segmented into diphone units, syllables or even words. Diphone concatenation systems are particularly prominent. A diphone is an acoustic unit which extends from the middle of one phoneme to the middle of the next phoneme. In other words, the diphone includes the transition between each partial phoneme. It is believed that synthesis using concatenation of diphones provides good voice quality since each diphone is concatenated with adjoining diphones where the beginning and the ending phonemes have reached steady state, and since each diphone records the actual transition from phoneme to phoneme.
However, significant problems in fact exist in current diphone concatenation systems. In order to achieve a suitable concatenation system, a minimum of 1500 to 2000 individual diphones must be used. When segmented from prerecorded continuous speech, suitable diphones may not be obtainable because many phonemes (where concatenation is to be taken place) have not reached a steady state. Thus, a mismatch or distortion can occur from phoneme to phoneme when the diphones are concatenated together. To reduce this distortion, diphone concatenative synthesizers, as well as others, often select their units from carrier sentences or monotone speech, and/or perform spectral smoothing, all of which can lead to a decrease of naturalness. The resulting synthetic speech may not resemble the donor speaker. In addition, the other neighboring contextual influence of a diphone unit could seriously introduce potential distortion at the concatenation points.
Another known concatenative synthesizer is described in an article entitled "Improvements in an HMM-Based Speech Synthesizer" by R. E. Donovan et al., Proc. Eurospeech '95, Madrid, September, 1995. The system uses a set of cross-word decision-tree state-clustered triphone HMMs to segment a database into approximately 4000 cluster states, which are then used as the units for synthesis. In other words, the system uses a senone as the synthesis unit. A senone is a context-dependent sub-phonetic unit which is equivalent to a HMM state. During synthesis, each state is synthesized for a duration equal to the average state duration plus a constant. Thus, the synthesis of each phoneme requires a number of concatenation points. Each concatenation point can contribute to distortion.
There is an ongoing need to improve text-to-speech synthesizers. In particular, there is a need to provide an improved concatenation synthesizer that minimizes or avoids the problems associated with known systems.