The present invention relates generally to speech synthesis and more particularly to a concatenative synthesizer based on a source-filter model in which the source signal and filter parameters are generated by independent cross fade mechanisms.
Modern day speech synthesis involves many tradeoffs. For limited vocabulary applications, it is usually feasible to store entire words as digital samples to be concatenated into sentences for playback. Given a good prosody algorithm to place the stress on the appropriate words, these systems tend to sound quite natural, because the individual words can be accurate reproductions of actual human speech. However, for larger vocabularies it is not feasible to store complete word samples of actual human speech. Therefore, a number of speech synthesists have been experimenting with breaking speech into smaller units and concatenating those units into words, phrases and ultimately sentences.
Unfortunately, when concatenating sub-word units, speech synthesists must confront several very difficult problems. To reduce system memory requirements to something manageable, it is necessary to develop versatile sub-word units that can be used to form many different words. However, such versatile sub-word units often do not concatenate well. During playback of concatenated sub-word units, there is often a very noticeable distortion or glitch where the sub-word units are joined. Also, since the sub-word units must be modified in pitch and duration, to realize the intended prosodic pattern, most often a distortion is incurred from current techniques for making these modifications. Finally, since most speech segments are influenced strongly by neighboring segments, there is not a simple set of concatenation units (such as phonemes or diphones) which can adequately represent human speech.
A number of speech synthesists have suggested various solutions to the above concatenation problems, but so far no one has successfully solved the problem. Human speech generates complex time-varying waveforms that defy simple signal processing solutions. Our work has convinced us that a successful solution to the concatenation problems will arise only in conjunction with the discovery of a robust speech synthesis model. In addition, we will need an adequate set of concatenation units, and the further capability of modifying these units dynamically to reflect adjacent segments.
The formant-based speech synthesizer of the invention is based upon a source-filter model that closely ties the source and filter synthesizer components to physical structures within the human vocal tract. Specifically, the source model is based on a best estimate of the source signal produced at the glottis, and the filter model is based on the resonant (formant-producing) structures generally above the glottis. For this reason, we call our synthesis technique "formant-based" synthesis. We believe that modeling the source and filter components as closely as possible to actual speech production mechanisms produces far more natural sounding synthesis that other existing techniques.
Our synthesis technique involves identifying and extracting the formants from an actual speech signal (labeled to identify approximate demi-syllable areas) and then using this information to construct demi-syllable segments each represented by a set of filter parameters and a source signal waveform. The invention provides a novel cross fade technique to smoothly concatenate consecutive demi-syllable segments. Unlike conventional blending techniques, our system allows us to perform cross fade in the filter parameter domain while simultaneously but independently performing "cross fade" (parameter interpolation) of the source waveforms in the time domain. The filter parameters model vocal tract effects, while the source waveforms model the glottal source. The technique has the advantage of restricting prosodic modification to only the glottal source, if desired. This can reduce distortion usually associated with the conventional blending techniques.
The invention further provides a system whereby interaction between initial and final demi-syllables can be taken into account. Demi-syllables represent the presently preferred concatenation unit. Ideally, concatenation units are selected at points of least co-articulatory effect. The syllable is a natural unit for this purpose, but choosing the syllable requires a large amount of memory. For systems with limited available memory, the demi-syllable is preferred. In the preferred embodiment we take into account how the initial and final demi-syllables within a given syllable interact with each other. We further take into account how demi-syllables across word boundaries and sentence boundaries interact with each other. This interaction information is stored in a waveform database containing not only the source waveform data and filter parameter data, but also the necessary label or marker data and context data used by the system in applying formant modification rules. The system operates upon an input phoneme string by first performing unit selection, then building an acoustic string of syllable objects and then rendering those objects by performing the cross fade operations in both source signal and filter parameter domains. The resulting output are source waveforms and filter parameters that may then be used in a source-filter model to generate synthesized speech.
The result is a natural sounding speech synthesizer that can be incorporated into many different consumer products. Although the techniques can be applied to any speech coding application, the invention is well suited for use as a concatenative speech synthesizer, suitable for use in text-to-speech applications. This system is designed to work within the current memory and processor constraints found in many consumer applications. In other words, the synthesizer is designed to fit into a small memory footprint, while providing better sounding synthesis than other synthesizers of larger size.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.