1. Field of the Invention
This invention relates to the way a digital audio and video data stream is decoded and played back to the user of a display system. It is applicable to any data stream, whether the data stream is received from a communications channel, or from a storage device such as an optical disk player. It is particularly useful in multimedia applications.
2. Prior Art
Currently, all simultaneous audio/video (A/V) playback is accomplished at essentially the recorded speed. It is well known in the art how to speed up and slow down video, with the audio portion of a presentation blanked out. This is done in video disk players and video cassette recorders routinely. Since the video is encoded on a frame-by-frame basis, the rate of frame display is slowed down, and each frame is displayed on a display device for an extended period, each period extending over multiple refreshes of the display device. The audio in this situation must be blanked out because it would be distorted beyond recognition by pitch changes.
It is also well known in the art how to speed up and slow down audio by itself without significant distortion. The technique most commonly used is Time Domain Harmonic Scaling or TDHS. In TDHS, a stream of audio is divided into pitch periods. The pitch periods are small enough so that there is a high degree of pitch similarity between adjacent intervals. When the audio stream is played back, pitch periods are added or drawn away as many times as needed to produce the desired playback rate, with little perceptible distortion in the audio pitch. For a given desired speech rate C defined as the ratio between the input signal length and the output signal length, a period of time T is defined in which the TDHS process is done once. If the audio is digitally encoded, T is also the time that it takes to play back an audio frame, where an audio frame consists of the samples collected in a fixed period of time, typically 1/30th of a second.
For expansion of the audio, an input signal of length T will produce and output signal of length T+P where P is the pitch period. If T is given in P units, for C&lt;1.0: ##EQU1## and so: ##EQU2##
Similarly, for audio compression (faster playback) C&gt;1.0, therefore: ##EQU3##
Every T, a weighted average window is defined on two input segments residing one pitch period apart. The output signal is defined by the following formula: EQU S(t+t)=S(t+t)W(t)+S(t+t+P)[1-W(t)]
The one pitch-length output segment is either added to the signal in between the two adjacent segments (for expansion) or replaces the two segments, effectively replacing two segments with one (for compression).
FIG. 5A is a waveform diagram illustrating the compression process and FIG. 5B is a waveform diagram illustrating the expansion process. The transient period of the window is rather short to keep the compressed or expanded signal as close to the original signal as possible. However, the period must be long enough to eliminate discontinuities.
Time Domain Harmonic Scaling is explained in detail in the article "Time-Domain Algorithms for Harmonic Bandwidth Reduction and Time Scaling of Speech Signals," by D. Malah, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-27, pp. 121-133, 1979, which is incorporated herein by reference. Information on Time Domain Harmonic Scaling is also contained in U.S. Pat. No. 4,890,325 to Taniguchi et al. which is incorporated herein by reference.
The techniques described above are generally applicable to digital or analog systems. In analog A/V systems which operate only at recorded speeds, audio and video are synchronized because they are physically recorded together. In digital systems a master time clock is involved. The video and audio are digitized separately and then multiplexed together. Usually, the video and audio data streams are also independently compressed before they are combined, although it is possible to multiplex together uncompressed digital audio and video and compress the final digital signal later.
During playback, in digital A/V systems audio and video decoders require timing information. Where the audio and video streams are compressed, the decoder decompresses them and clocks each frame out to the next stage for playback using the timing information. If the streams are uncompressed, the decoders would simply use the timing information to control audio and video buffers and send the frames to the next stage at the appropriate rate. In any case, the decoders must maintain synchronization between the audio and video within one video frame interval (usually 1/30th second) in order to ensure that a user perceives a synchronized A/V presentation.
One well-known standard for synchronized recording and playback of compressed digital audio and video data streams is the so-called "MPEG" (Motion Picture Experts Group) standard. The latest version of the MPEG standard is published as ISO Committee Draft 11172-2, "Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s," November, 1991, and is incorporated herein by reference.
As can be seen from the above discussion, the prior art includes systems for variable speed playback of audio alone, variable speed playback of video alone, and a way of recording and playing back compressed, synchronized digital audio and video data. What is needed is a system which uses all of these techniques to provide a way for a user who is playing back a digital A/V presentation to vary the speed of presentation and be presented with synchronized, high quality audio and video from a digital source. This would allow the user to cue the information based on either the audio or the video content, or both, and to slow down or speed up the rate of presentation and still perceive both the audio and the video.