Time scaling (e.g., time compression or expansion) of a digital audio signal changes the play rate of a recorded audio signal without altering the perceived pitch of the audio. Accordingly, a listener using a presentation system having time scaling capabilities can speed up the audio to more quickly receive information or slow down the audio to more slowly receive information, while the time scaling preserves the pitch of the original audio to make the information easier to listen to and understand. Ideally, a presentation system with time scaling capabilities should give the listener control of the play rate or time scale of a presentation so that the listener can select a rate that corresponds to the complexity of the information being presented and the amount of attention that the listener is devoting to the presentation.
FIG. 1A illustrates representations of a stereo audio signal using stereo audio data 100 and time-scaled stereo audio data 110. Stereo audio data 100 includes left input data 100L representing the left audio channel of the stereo audio and right input data 100R representing the right audio channel of the stereo audio. Similarly, time-scaled stereo audio data 110, which is generated from stereo audio data 100, includes left time-scaled audio data 110L and right time-scaled audio data 110R.
A conventional time scaling process for the stereo audio performs independent time scaling of the left and right channels. For the time scaling processes, the samples of the left audio signal in left audio data 100L are partitioned into input frames IL1 to ILX, and the samples of the right audio signal in right audio data 100R are partitioned into input frames IR1 to IRX. The time scaling process generates left time-scaled output frames OL1 to OLX and right time-scaled output frames OR1 and ORX that respectively contain samples for the left and right channels of a time-scaled stereo audio signal. Generally, the ratio of the number m of samples in an input frame to the number n of samples in the corresponding output frame is equal to the time scale used in the time scaling process, and for a time scale greater than one, the time-scaled output frames OL1 to OLX and OR1 to ORX contain fewer samples than do the respective input frames IL1 to ILX and IR1 to IRX. For a time scale less than one, the time-scaled output frames OL1 to OLX and OR1 to ORX contain more samples than do the respective input frames IL1 to ILX and IR1 to IRX.
Some time scaling processes use time offsets that indicate portions of the input audio that are overlapped and combined to reduce or expand the number of samples in the output time-scaled audio data. For good sound quality when combining samples, this type of time scaling process typically searches for a matching blocks of samples, shifts one of the blocks in time to overlap the matching block, and then combines the matching blocks of samples. Such time-scaling processes can be independently applied to left and right channels of a stereo audio signal. As illustrated in FIG. 1B, for example, time offsets ΔTLi and ΔTRi from the beginnings of respective left and right buffers 120L and 120R uniquely identify blocks 125L and 125R best matching input frames ILi and IRi, respectively. Each best match block 125L or 125R can be arithmetically combined with the corresponding input frame ILi or IRi to generate modified samples for the output time-scaled data.
As illustrated in FIG. 1B, time offsets ΔTLi and ΔTRi corresponding to the same frame number (i.e., the same time interval in the input stereo audio) can differ from each other because the offsets are determined independently for left and right audio data 100L and 100R. Generally, the difference in the time offsets for left and right channels varies so that offset ΔTLi is shorter than offset ΔTRi for some frames (i.e., some values of frame index i) and ΔTRi is shorter than offset ΔTLi for other frames offset (i.e., other values of frame index i).
For stereo audio generally, when matching sounds from the same source are played through left and right speakers, a listener perceives a small difference in timing of the matching sounds as a single sound emanating from a location between the left and right speakers. If the timing difference changes, the location of the source of the sound appears to move. In time-scaled stereo audio data, an artifact of the variations in offsets ΔTLi and ΔTRi with frame index i is an apparent oscillation or variation in the position of the source of audio being played. Similarly, variations in the offsets ΔTLi and ΔTRi can cause timing variations in the related sounds in different channels such as different instruments played through different channels. These artifacts annoy some listeners, and systems and methods for avoiding the variations in the apparent position of a sound source in a time-scaled stereo audio signal are sought.