A difficult problem in audio signal synthesis, especially synthesis of musical instrument sounds, is modeling the time-varying spectrum of the synthesized audio signal. The spectrum generally changes with pitch and loudness. In the present invention, we describe methods and means for estimating the time-varying spectrum of an audio signal based on a conditional probability density function (PDF) of spectral coding vectors conditioned on pitch and loudness values. We also describe methods and means for synthesizing an output audio signal in response to an input audio signal by estimating a time-varying input spectrum based on a conditional PDF of input spectral coding vectors conditioned on input pitch and loudness values and for deriving a residual spectrum based on the difference between the estimated spectrum and the "true" spectrum of the input signal. The residual spectrum is then incorporated into the synthesis of the output audio signal.
In Continuous Probabilistic Transform for Voice Conversion, IEEE Transactions on Speech and Audio Processing, Volume 6 Number 2, March 1988, by Stylianou et al., a system for transforming a human voice recording so that it sounds like a different voice is described, in which a voiced speech signal is coded using time-varying harmonic amplitudes. A cross-covariance matrix of harmonic amplitudes is used to describe the relationship between the original voice spectrum and desired transformed voice spectrum. This cross-covariance matrix is used to transform the original harmonic amplitudes into new harmonic amplitudes. To generate the cross-covariance matrix speech recordings are collected for the original and transformed voice spectra. For example, if the object is to transform a male voice into a female voice then a number of phrases are recorded of a male and a female speaker uttering the same phrases. The recorded phrases are time-aligned and converted to harmonic amplitudes. Cross-correlations are computed between the male and female utterances of the same phrase. This is used to generate the cross-covariance matrix that provides a map from the male to the female spectra. The present invention is oriented more towards musical instrument sounds where the spectrum is correlated with pitch and loudness. This specification describes methods and means of transforming an input to an output spectrum without deriving a cross-covariance matrix. This is important since it means that time-aligned utterances of the same phrases do not need to be gathered.
U.S. Pat. No. 5,744,742, to Lindemann et al., teaches a music synthesis system wherein during a sustain portion of the tone, amplitude levels of an input amplitude envelope are used to select filter coefficient sets in a sustain codebook of filter coefficient sets arranged according to amplitude. The sustain codebook is selected from a collection of sustain codebooks according to the initial pitch of the tone. Interpolation between adjacent filter coefficient sets in the selected codebook is implemented as a function of particular amplitude envelope values. This system suffers from a lack of responsiveness of spectrum changes due to continuous changes in pitch since the codebook is selected according to initial pitch only. Also, the ad-hoc interpolation between adjacent filter coefficient sets is not based on a solid PDF model and so is particularly vulnerable to spectrum outliers and does not take into consideration the variance of filter coefficient sets associated with a particular pitch and amplitude level. Nor does the system consider the residual spectrum related to incorrect estimates of spectrum from pitch and amplitude. These defects in the system make it difficult to model rapidly changing spectra as a function of pitch and loudness, and so restrict the use of the system to sustain portions of a tone only. The attack and release portion of the tone are modeled by deterministic sequences of filter coefficients that do not respond to instantaneous pitch and loudness.