The present invention relates generally to speech and waveform synthesis. The invention further relates to the extraction of formant-based source-filter data from complex waveforms. The technology of the invention may be used to construct text-to-speech and music synthesizers and speech coding systems. In addition, the technology can be used to realize high quality pitch tracking and pitch epoch marking. The cost functions employed by the present invention can be used as discriminatory functions or feature detectors in speech labeling and speech recognition.
One way of analyzing and synthesizing complex waveforms, such as waveforms representing synthesized speech or musical instruments, is to employ a source-filter model. Using the source-filter model, a source signal is generated and then run through a filter that adds resonances and coloration to the source signal. The combination of source and filter, if properly chosen, can produce a complex waveform that simulates human speech or the sound of a musical instrument.
In source-filter modeling, the source waveform can be comparatively simple: white noise or a simple pulse train, for example. In such case the filter is typically complex. The complex filter is needed because it is the cumulative effect of source and filter that produces the complex waveform. Alternatively, the source waveform can be comparatively complex, in which case, the filter can be more simple. Generally speaking, the source-filter configuration offers numerous design choices.
We favor a model that most closely represents the natural occurring degree of separation between human glottal source and the vocal tract filter. When analyzing the complex waveform of human speech, it is quite challenging to ascertain which aspects of the waveform may be attributed to the glottal source and which aspects may be attributed to the vocal tract filter. It is theorized, and even expected, that there is an acoustic interaction between the vocal tract and the nature of the glottal waveform which is generated at the glottis. In many cases this interaction may be negligible, hence in synthesis it is common to ignore this interaction, as if source and filter are independent.
We believe that many synthesis systems fall short due to a source-filter model with a poor balance between source complexity and filter complexity. The source model is often dictated by ease of generation rather than the sound quality. For instance linear predictive coding (LPC) can be understood in terms of a source-filter model where the source tends to be white (i.e. flat spectrum). This model is considerably removed from the natural separation between human vocal tract and glottal source, and results in poor estimates of the first formant and many discontinuities in the filter parameters.
An approach heretofore taken as an alternative of LPC to overcome the shortcomings of LPC involves a procedure called "analysis by synthesis." Analysis by synthesis is a parametric approach that involves selecting a set of source parameters and a set of filter parameters, and then using these parameters to generate a source waveform. The source waveform is then passed through the corresponding filter and the output waveform is compared with the original waveform by a distance measure. Different parameter sets are then tried until the distance is reduced to a minimum. The parameter set that achieves the minimum is then used as a coded form of the input signal.
Although analysis by synthesis does a good job of optimizing a parametric voice source with a vocal tract modeling filter, it imposes a parametric source model assumption that is difficult to work with.
The present invention takes a different approach. The present invention employs a filter and an inverse filter. The filter has an associated set of filter parameters, for example, the center frequency and bandwidth of each resonator. The inverse filter is designed as the inverse of the filter (e.g. poles of one become zeros of the other and vice versa). Thus the inverse filter has parameters that bear a relationship to the parameters of the filter. A speech signal is then supplied to the inverse filter to generate a residual signal. The residual signal is processed to extract a set of data points that define a line or curve (e.g. waveform) that may be represented as plural segments.
Different processing steps may be employed to extract and analyze the data points, depending on the application. These processing steps include extracting time domain data from the residual signal and extracting frequency domain data from the residual signal, either performed separately or in combination with other signal processing steps.
The processing steps involve a cost calculation based on a length measure of the line or waveform which we term "arc-length." The arc-length or its square is calculated and used as a cost parameter associated with the residual signal. The filter parameters are then selectively adjusted through iteration until the cost parameter is minimized. Once the cost parameter is minimized, the residual signal is used to represent an extracted source signal. The filter parameters associated with the minimized cost parameter may also then be used to construct the filter for a source-filter model synthesizer.
Use of this method results in a smoothness or continuity in the output parameters. When these parameters are used to construct a source-filter model synthesizer, the synthesized waveform sounds remarkably natural, without distortions due to discontinuities. A class of cost functions, based on the arc-length measure, can be used to implement the invention. Several members of this class are described in the following specification. Others will be apparent to those skilled in the art.