The present invention deals with formant tracking. More specifically, the present invention deals with formant tracking using a formant synthesizer.
The human vocal tract has a number of resonances. The speaker can change the frequency of these resonances to produce different sounds. For example, the speaker can change the configuration of the vocal tract by movement of the tongue or lips and the inclusion or exclusion of the nasal tract. These resonances are excited by the movement of the vocal cords or noise generated at a constriction of the vocal tract. Each sound has an associated set of resonances, and when sounds are strung together in a time wise fashion, they form words. These resonances are referred to as formants.
In speech analysis, the first three resonances (or formants) are generally of primary interest. Higher frequency formants vary minimally, and are usually based on the length of the particular speaker""s vocal tract. Thus, the higher frequency formants do not carry a great deal of information with respect to the words being spoken.
The formants associated with each sound can vary a great deal from speaker-to-speaker. Further, formants can vary from one utterance to another, even for the same speaker. Thus, tracking formants is quite difficult.
Formant trackers are conventionally used to identify and track formants in human speech. This information is useful in speech analysis. Standard formant trackers perform linear prediction on the speech signal in order to identify the resonances or formants associated with the speech signal. In other words, at some point in time, n, the speech signal is represented as follows:       s    ⁡          (      n      )        =                              a          1                *                  s          ⁡                      (                          n              -              1                        )                              +                        a          2                *                  s          ⁡                      (                          n              -              2                        )                              +      …      ⁢              xe2x80x83            +              x        ⁡                  (          n          )                      =                            ∑                      i            =            1                    p                ⁢                              a            i                    ⁢                      s            ⁡                          (                              n                -                i                            )                                          +              x        ⁡                  (          n          )                    
where s(n) is the speech signal, x(n) is the excitation, and the coefficients ai are the impulse response of the vocal tract.
The roots of the equation represent poles, and a single pole pair has a specific frequency response. Thus, each formant track (each set of three formants) corresponds to three pole pairs.
A conventional formant tracker divides the speech signal into consecutive frames having a predetermined duration (such as 10 millisecond). By taking the roots of the filter defined by Equation 1, the resonances for each frame can be found. However, for each 10 millisecond frame, the linear prediction algorithm may identify a relatively large number (such as seven) of resonances. Although this number can be controlled in performing the linear prediction calculations, more than three resonances must be calculated, in order to model any noise or non-linearities present in the signal. The formant tracker then attempts to find smooth paths for three primary formants at each frame, given the seven resonances identified by the linear prediction algorithm.
Conventional formant trackers have problems. The primary problem associated with conventional formant trackers is that they fail to select the proper resonances identified by linear prediction, and thus fail to find the proper formants. Also, conventional formant trackers can provide discontinuous formant tracks based on inaccurate identification of resonances.
Formant synthesizers are a type of speech synthesizer used to produce speech from a phonetic description of an utterance. Formant synthesizers are generally trained by phoneticians, who in essence codify their knowledge of speech production into the mathematical codes and data tables that the formant synthesizer uses to generate formants from a phonetic representation of an utterance.
During synthesis, the input text is typically broken into the phonemic units, and those units are provided to the formant synthesizer. The formant synthesizer then generates formants or formant tracks which are reasonable and expected based on the speech units input into the synthesizer. Normally, the formant tracks are then used to create synthetic speech.
Formants corresponding to input speech units are generated from a formant synthesizer. A frequency response is generated based on the synthesized formants. A second frequency response is generated based on a speech signal which is received and which corresponds to utterances of the speech units. The synthesized formants are modified based on a comparison of the frequency response corresponding to the synthesized formants and the frequency response of the input speech signal.