The present invention relates to speech recognition and synthesis systems and in particular to speech systems that exploit formants in speech.
In human speech, a great deal of information is contained in the first three resonant frequencies or formants of the speech signal. In particular, when a speaker is pronouncing a vowel, the frequencies and bandwidths of the formants indicate which vowel is being spoken.
To detect formants, some systems of the prior art utilize the speech signal""s frequency spectrum, where formants appear as peaks. In theory, simply selecting the first three peaks in the spectrum should provide the first three formants. However, due to noise in the speech signal, non-formant peaks can be confused for formant peaks and true formant peaks can be obscured. To account for this, prior art systems qualify each peak by examining the bandwidth of the peak. If the bandwidth is too large, the peak is eliminated as a candidate formant. The lowest three peaks that meet the bandwidth threshold are then selected as the first three formants.
Although such systems provided a fair representation of the formant track, they are prone to errors such as discarding true formants, selecting peaks that are not formants, and incorrectly estimating the bandwidth of the formants. These errors are not detected during the formant selection process because prior art systems select formants for one segment of the speech signal at a time without making reference to formants that had been selected for previous segments.
To overcome this problem, some systems use heuristic smoothing after all of the formants have been selected. Although such post-decision smoothing removes some discontinuities between the formants, it is less than optimal.
In speech synthesis, the quality of the formant track in the synthesized speech depends on the technique used to create the speech. Under a concatenative system, sub-word units are spliced together without regard for their respective formant values. Although this produces sub-word units that sound natural by themselves, the complete speech signal sounds unnatural because of discontinuities in the formant track at sub-word boundaries. Other systems use rules to control how a formant changes over time. Such rule-based synthesizers never exhibit the discontinuities found in concatenative synthesizers, but their simplified model of how the formant track should change over time produces an unnatural sound.
The present invention utilizes a formant-based model to improve the creation of formant tracks in synthesized speech. Text is divided into a sequence of formant model states, which are used to retrieve a sequence of stored excitation segments. The states are also provided to a formant path generator, which determines a set of most likely formant paths given the sequence of model states and the formant models for each state. The formant paths are then used to control a series of resonators, which introduce the formants into the sequence of excitation segments. This produces a sequence of speech segments that are later combined to form the synthesized speech signal.