The present invention relates to speech recognition and synthesis systems and in particular to speech systems that exploit formants in speech.
In human speech, a great deal of information is contained in the first three resonant frequencies or formants of the speech signal. In particular, when a speaker is pronouncing a vowel, the frequencies and bandwidths of the formants indicate which vowel is being spoken.
To detect formants, some systems of the prior art utilize the speech signal""s frequency spectrum, where formants appear as peaks. In theory, simply selecting the first three peaks in the spectrum should provide the first three formants. However, due to noise in the speech signal, non-formant peaks can be confused for formant peaks and true formant peaks can be obscured. To account for this, prior art systems qualify each peak by examining the bandwidth of the peak. If the bandwidth is too large, the peak is eliminated as a candidate formant. The lowest three peaks that meet the bandwidth threshold are then selected as the first three formants.
Although such systems provided a fair representation of the formant track, they are prone to errors such as discarding true formants, selecting peaks that are not formants, and incorrectly estimating the bandwidth of the formants. These errors are not detected during the formant selection process because prior art systems select formants for one segment of the speech signal at a time without making reference to formants that had been selected for previous segments.
To overcome this problem, some systems use heuristic smoothing after all of the formants have been selected. Although such post-decision smoothing removes some discontinuities between the formants, it is less than optimal.
In speech synthesis, the quality of the formant track in the synthesized speech depends on the technique used to create the speech. Under a concatenative system, sub-word units are spliced together without regard for their respective formant values. Although this produces sub-word units that sound natural by themselves, the complete speech signal sounds unnatural because of discontinuities in the formant track at sub-word boundaries. Other systems use rules to control how a formant changes over time. Such rule-based synthesizers never exhibit the discontinuities found in concatenative synthesizers, but their simplified model of how the formant track should change over time produces an unnatural sound.
The present invention utilizes a formant-based model to improve formant tracking and to improve the creation of formant tracks in synthesized speech.
Under one aspect of the invention, a formant-based model is used to track formants in an input speech signal. Under this part of the invention, the input speech signal is divided into segments and each segment is examined to identify candidate formants. The candidate formants are grouped together and sequences of groups are identified for a sequence of speech segments. Using the formant model, the probability of each sequence of groups is then calculated with the most likely sequence being selected. This sequence of groups then defines the formant tracks for the sequence of segments.
Under one embodiment of the invention, the formant tracking system is used to train the formant model. Under this embodiment, the formant track selected for the sequence of segments is analyzed to generate a mean frequency and mean bandwidth for each formant in each formant model state. These mean frequencies and bandwidths are then used in place of the existing values in the formant model.
Another aspect of the present invention is the compression of a speech signal based on a formant model. Under this aspect of the invention, the formant track is determined for the speech signal using the technique described above. The formant track is then used to control a set of filters, which remove the formants from the speech signal to produce a residual excitation signal. Under some embodiments, this residual excitation signal is further compressed by decomposing the signal into a voiced and unvoiced portion. The magnitude spectrums of both of these portions are then compressed into a smaller set of representative values.
A third aspect of the present invention uses the formant model to synthesize speech. Under this aspect, text is divided into a sequence of formant model states, which are used to retrieve a sequence of stored excitation segments. The states are also provided to a formant path generator, which determines a set of most likely formant paths given the sequence of model states and the formant models for each state. The formant paths are then used to control a series of resonators, which introduce the formants into the sequence of excitation segments. This produces a sequence of speech segments that are later combined to form the synthesized speech signal.