Time-scale modification (TSM) is an emerging topic in audio digital signal processing due to the advance of low-cost, high-speed hardware that enables real-time processing by portable devices. Possible applications include intelligible sound in fast-forward play, real-time music manipulation, foreign language training, etc. Most time scale modification algorithms can be classified as either frequency-domain time scale modification or time-domain time scale modification. Frequency-domain time scale modification provides higher quality for polyphonic sounds, while time-domain time scale modification is more suitable for narrow-band signals such as voice. Time-domain time scale modification is the natural choice in resource-limited applications due to its lower computational cost.
The basic operation of time domain time-scale modification is successively overlapping and adding audio frames, where time scaling is achieved by changing the spacing between them. It is known in the art to calculate the exact overlap point based on a measure of similarity between the signals to be overlapped. This measure of similarity is generally based on cross-correlation.
Most time-domain time-scale modification algorithms are derived from the synchronous overlap-and-add method (SOLA). The synchronous overlap-and-add algorithm and its variations are based on successive overlap and addition of audio frames. For the overlap, the overlap point is adjusted by computing a measure of signal similarity between the overlapping regions for each possible overlap position, which is limited by a minimum and maximum overlap points. The position of maximum similarity is selected. The signal similarity measure can be represented as a full cross-correlation function or simplified versions. This similarity calculation represents about 80% or more of the total computation required by the algorithm.
Even though SOLA based methods represent an attractive low-cost solution to the time-scale modification problem, their limitation stands out in the case of polyphonic music signals. Their intrinsic problem is that the audio signal is treated as a whole without consideration for its individual frequency components, so that the overlap point adjustment based on signal similarity cannot simultaneously generate smooth transitions for the multiple frequency components of the signal.
A family of methods known as phase vocoder does time-scale modification in the frequency domain. The input signal is analyzed at equally spaced overlapping windowed frames using a short-time discrete Fourier transform. Next the phase difference for spectral peaks is calculated. This phase difference is the difference in phase between an input phase and a time scale modified signal phase. An intrinsic sinusoidal model is generally used. The frequency is represented by the sum Ωk+ωik: where carrier Ωk is 2πk/N; and ωik is an instantaneous frequency modulator. This produces an estimate ωik for each spectral line by obtaining the phase difference between two consecutive analysis frames. Here, k is a spectral line number and N is the size of the short-time discrete Fourier transform. The process reconstructs an output signal from the analyzed frames using a short-time inverse discrete Fourier transform. The frames are overlapped by a different overlap factor to achieve the desired time scaling. The instantaneous frequency ωik is used to calculate the phase corresponding to each spectral line in the time shifted instant.
Even though phase vocoders can potentially achieve higher quality than time-domain methods, a severe limitation is the large amount of computation required in the forward and inverse discrete Fourier transforms and also in the spectrum manipulation process. Practical implementations on fixed-point processors result in a computational cost up to 10 times higher than time-domain time-scale modification methods. In addition, maintaining phase coherence between frames is not an easy task and can be the source of artifacts.