Digital multimedia content is pervasive for both entertainment and work purposes. For entertainment and personal use, the proliferation of the Internet makes it possible for users to easily download digital music or music video from the Internet and play them on their personal computers. For work use, many corporations have their internal training videos and other work-related content available on Intranets. Thus, the volume of content available to a user is tremendous.
The volume of content can be at times overwhelming to a user. Often, the user will desire to consume the content at a speed different from that speed at which the content was created. As an analogy, a person may read text at different rates depending on the situation. For example, when reading a deep technical article, the reading rate typically is slower than if the person is merely skimming a magazine. Moreover, reading rates differ between people.
Just as text is read at different reading rates, it is desirable to provide a user with the ability to vary the playback speed of a digital audio signal. In other words, a user can have the ability to speed-up or slow-down audio content based on her preferences. For example, it is desirable for a user to be able to slow down the playback speed of a digital audio signal if he is trying to transcribe the lyrics of a song or take notes of a training video. Or, a user may want to speed up the slow sections of a presentation.
One of the simplest techniques for achieving variable speed playback is to play the audio signal at a different sampling rate from the rate it is captured. For example, an audio signal that was sampled at 16K Hz sampled signal and played back at 32K Hz achieves a factor of two (2×) speed up. One problem with this technique, however, it that audio pitch of the signal is distorted. A chipmunk-like effect is created when speeding up the signal, due to the increased pitch of the audio. Conversely, the pitch is lowered when slowing down the audio signal.
An improvement on the above technique is pitch-invariant variable speed playback. Pitch-invariant variable speed audio playback techniques change the playback speed of audio content without causing the pitch to change. The most basic of such techniques take short audio frames, discard a portion of the frames, and connect the remaining frames. A frame is a group of consecutive audio samples of fixed length (such as 100 ms). A portion of the frames are discarded, for example, dropping 33 ms of a frame to get 1.5× compression. The remaining samples then are abutted. One problem with these pitch-invariant variable speed audio playback techniques is that they produce artifacts (such as audible “clicks”) and other forms of signal distortion. These artifacts and signal distortions are caused by discontinuities at the interval boundaries produced by discarding samples and abutting the remnants.
Instead of abutted, a technique called Overlap Add (OLA) uses an overlapped region at the junctions of the two frames and applies a windowing function or smoothing filter (such as a cross-fade) to the transition. OLA largely eliminates clicks in the output signal, but sometimes reverberations still can be heard.
An improvement to the OLA technique is the Synchronized OLA (SOLA) technique. The SOLA technique includes shifting the beginning of a new audio frame over the end of the preceding frame to find the point of highest waveform similarity. This is achieved by a cross-correlation computation. Once this point is found, the frames are overlapped, as in OLA technique. The SOLA technique provides a locally optimal match between successive frames and mitigates the reverberations sometimes introduced by the OLA technique. Nevertheless, some artifacts still are noticeable when using the SOLA technique, especially at larger playback speed variation.