Conventional methods for time scaling digital audio signals broadly fall into two general categories: time-domain methods, and frequency-domain methods. A sound waveform generally exhibits repetition of a certain shape locally, especially for speech signals. Each of these repeated waveforms includes an almost identical spectrum and, thus, sounds very similar. Accordingly, such repetitions may be added or dropped without changing the sound. This is generally the theoretical basis for time-domain time scaling processes. For example, such processes could identify two splicing points, between which the samples are dropped for compressing the time scale or are repeated for stretching the time scale. The optimal splicing points have to be found jointly, because changing one point may lead to a different optimal location for the other point. The difficulty lies in the fact that there are often too many possible combinations of two splicing points. Accordingly, exhaustive searches are not feasible for real-time processing due to the prohibitively high computational costs associated with such processing.
The frequency-domain method can work by interpolating/extrapolating the frequency samples. Since the signal often is PCM samples in the time domain, conventional frequency-domain methods involve windowing the time-domain signal by a smooth window such as, for example, a raised cosine window. Then, these methods can include transforming the windowed time-domain signal into a frequency-domain representation by a transformation method like discrete Fourier transform (DFT), or fast Fourier transform (FFT) for fast computation. The desired frequency samples (according to the corresponding desired time scaling factors) are then obtained from the obtained frequency samples, through interpolation/extrapolation, where both magnitude and phase are handled.