In one embodiment, the present invention relates to a method and apparatus for modifying an audio signal employing table lookup to perform non-linear transformations of the Short Time Fourier Transform of the audio signal.
Reproduction and modification of audio signals has posed a significant challenge for many years. Early attempts to accurately reproduce audio signals had various drawbacks. For example, an early attempt at reproducing speech signals employed linear predictive (LP) modeling, described by J. Makhoul, "Linear Prediction: A Tutorial Review," Proc. IEEE, vol. 63, pp. 561-580, April 1975. In this approach, the speech production process is modeled as a linear time-varying, all-pole vocal tract filter driven by an excitation signal representing characteristics of the glottal waveform. However, LPC is inherently constrained by the assumption that the vocal tract may be modeled as an all-pole filter. Deviations of an actual vocal tract from this ideal results in an excitation signal without the purely pulse-like or noisy structure assumed in the excitation model. This results in reproduced speech having noticeable and objectionable distortions.
Frequency-domain representations of audio signals, such as speech, overcome many of the drawbacks associated with linear predictive modeling. Frequency domain representation of audio signals is based upon the observations that much of the speech information is frequency related and that speech production is an inherently non-stationary process. As discussed in the article by J. L. Flanagen and R. M. Golden, "Phase Vocoder," Bell Sys. Tech. J., vol. 45, pp. 1493-1509, 1966, a short-time Fourier transform (STFT) formulation of an audio signal may be employed to parameterize speech production information in a manner very similar to LP modeling. This is commonly referred to as the digital phase vocoder (DPV) and is capable of performing speech modifications without the constraints of LPC. However, the DPV is computationally intensive, limiting its usefulness in real-time applications.
To reduce the computational intensity of the DPV, another approach employs the discrete short-time Fourier transform (DSTFT), implemented using a Fast Fourier Transform (FFT) algorithm. This enables modeling of an audio signal as a discrete signal x(n) that can be reconstructed from a sequence X (k,m) of its windowed Discrete Fourier Transforms (DFTs) by applying an inverse Discrete Fourier Transform to each DFT and then properly weighting and overlap-adding the sequence of inverse DFTs ##EQU1##
and L is the spacing between successive DFTs. It is also well known that modified versions of x(n) can be obtained by applying the above reconstruction formula to a sequence of modified DFTs. Due to the success of the DSTFT in reducing the computational complexity, many prior art methods have been employed to modify the differing audio information contained therein. For example, M. R. Portnoff, in "Time-Scale Modification of Speech Based on Short-Time Fourier Analysis," IEEE Trans. Acoustics, Speech, and Signal Proc., pp. 374-390, vol. ASSP-29, No. 3 (1981) describes a technique for reducing phase distortions which arise when employing the modified DSTFT.
U.S. Pat. No. 4,856,068 to Quatieri, Jr. et al. describes an audio pre-processing method and apparatus to achieve a flattened time-domain envelope to satisfy peak power constraints. Specifically, an audio signal, representing a speech waveform, is processed before transmission to reduce the peak-to-RMS ratio of the waveform. The system estimates and removes natural phase dispersion in the frequency component of the speech signal. Artificial dispersion based on pulse compression techniques is then introduced with little change in speech quality. The new phase dispersion allocation serves to pre-process the waveform prior to dynamic range compression and clipping. In this fashion, deeper thresholding may be accomplished than would otherwise be the case on the original speech waveform.
U.S. Pat. No. 4,885,790 to McAulay et al. describes an analysis/synthesis technique for processing an audio signal, such as a speech waveform which characterizes the speech waveform by the amplitudes, frequencies and phases of component sine waves. These parameters are estimated from a short-time Fourier transform, with rapid changes in highly-resolved spectral components being tracked using the concept of "birth" and "death" of the underlying sine waves. The component values are interpolated from one frame to the next to yield a representation that is applied to a sine wave generator. The resulting synthetic waveform preserves the general waveform shape.
There exists a need, however, for computationally efficient approaches for selectively modifying a subportion of information contained in a DSTFT representation of audio signals without substantially effecting the remaining audio information contained therein.