Time-scale modification (TSM) is an emerging topic in audio digital signal processing due to the advance of low-cost, high-speed hardware that enables real-time processing by portable devices. Possible applications include intelligible sound in fast-forward play, real-time music manipulation, foreign language training, etc. Most time scale modification algorithms can be classified as either frequency-domain time scale modification (sometimes known as phase vocoders) or time-domain time scale modification.
Frequency-domain time scale modification is based upon reconstruction of a signal from a short-time discrete Fourier transformation (ST-DFT) from the time domain to the frequency domain using overlapping windows. Upon reconstruction a different set of analysis windows enables time compression or time expansion. The phases of spectral lines must be rotated according to an estimate of their instantaneous frequencies. Time-domain time scale modification is similar but uses overlapping or adding signals in the time domain. Frequency-domain time scale modification is generally believed to provide higher quality for polyphonic sounds than time-domain time scale modification, which is believed more suitable for narrow-band signals such as voice. This advantage for polyphonic sounds is achieved at the expense of higher computational cost.
Frequency-domain time scale modification produces some characteristic artifacts in the reconstructed sound. These include reverberation and loss of sound presence. A speaker may appear farther from the microphone in the reconstructed sound than in the original audio. Some of these artifacts are believed introduced by lack of phase coherence between neighboring spectral lines. The quality of frequency-domain time scale modification can be significantly improved by repairing this phase incoherence. This technique is called phase locking. A common technique seeks local spectral peaks, partitions the spectrum into regions dominated by these peaks and then locks the phase of spectral lines of each region according to the peak. The locked phases are forced to keep the same relation as the input spectrum before phase rotation. In rigid phase locking this relation is fixed. In scaled phase locking this relation is scaled by a proportionality factor. These methods generally eliminate reverberation but introduce additional artifacts making the resultant sound seem artificial or synthetic. Some of this artificiality can be mitigated by control of the scaling factor, but the sound is generally perceived of low overall quality.