The era of silent pictures was ushered out in the early 1900's with the invention of motion pictures with sound. In order to enjoy motion pictures, the video and sound tracks needed to be in synchronization. In other words, when lips are seen to move on the screen and speech is heard, a viewer expects the two to match. In an ideal world, the video and audio match perfectly. However, the world is not ideal and, therefore, we desire to find ways to optimize the synchronization of video and audio in order to meet the viewer's expectations.
Generally, video and audio need to match to an accuracy of not much worse than 1/20 of a second in order to be acceptable for the viewer. Accuracy better than 1/60 of a second is nearly impossible on television because new pictures are displayed at that frequency and there is no way to show any movement of the lips until the next new picture. Accuracy worse than 1/10 of a second is usually noticeable by the viewer and accuracy of worse than 1/5 of a second is almost always noticeable.
Maintaining synchronization is generally not very difficult when the video data and the audio data are integrated and played using a single video/audio source. For example, a conventional video cassette recorder reads and plays both the video and audio tracks of a tape in a single integrated process. This process maintains synchronization of the video and audio tracks. In other words, when the tape is advanced, the video information is read and displayed on the screen at the same time the audio information is read and played out the speaker. This single sequence paradigm is shattered in the realm of digital video.
In digital video, audio data and video data can be separated and independently decoded, processed, and played. Furthermore, many computer users desire to view digital video while performing some other task or function within the computer, such as sending or receiving information from a computer network. The ability to independently perform these multimedia tasks while simultaneously performing other computer functions can be useful and typically requires a multitasking or multithreaded computing environment.
However, this ability also introduces additional multimedia synchronization problems. In particular, the mere separation of video data and audio data and their independent decoding/processing/playing functions makes it easier to get the video data and the audio data out of synchronization. This is analogous to separating two finely-toothed mechanical gears, independently manipulating each gear, and bringing them back together again in the hope that they will instantly mesh together. Thus, in both the multimedia data processing situation and the mechanical gears situation, one can easily envision problems putting the separated components back together.
Video compression techniques, such as a digital video compression standard established by the Moving Pictures Experts Group (MPEG) under the International Standards Organization (ISO), allow large amounts of multimedia data to be stored within relatively small amounts of memory. This has been extremely useful in efforts to reduce storage and transmission of digital video where storage and bandwidth are a premium. However, the use of such compression techniques requires the multimedia data to be decoded before it can be played. This is often a compute intensive task. Furthermore, in multitasking or multithreaded computing environments, competing processes may steal away processing cycles of the central processor. As a result, the ability to read, decode, process, and play the multimedia data will vary so that the ability to synchronously present the multimedia data to the computer user becomes impaired. In summary, maintaining synchronization of audio data and video data can be problematic.
There are several ways to attempt to solve this problem. The speed of the audio data can be altered to match that of the video data. However, altering the speed of audio is difficult. Most current audio hardware does not support simple alterations in the rate for playing audio. However, existing strategies for altering the audio rate where possible also cause alterations to the sound which are typically unpleasant to the viewer (e.g., wavering alterations in musical pitch, dropping of meaningful consonants or syllables from speech, etc.). For this reason, the audio is generally taken as defining the standard of time and the video is made to keep pace with it.
Another way to solve this problem uses a brute force approach of merely increasing the performance of the hardware. If the computer system has a performance level which is fast enough to keep pace with the compute intensive decoding and playing of both audio data and video data at all times, synchronization of the audio and video can be maintained. Such a powerful computer system can finish decoding the video data and have time left before displaying the decoded video data at a due time synchronous to the due time of the audio data. This merely requires waiting for the right moment before displaying each frame of video data.
A technique of using a common software clock when playing an audio sequence of data synchronized to a video sequence of data can also be used to solve synchronization problems. This technique is the subject matter of PCT Patent Application No. WO 94/27234 entitled, "Multimedia Synchronization System," published on Nov. 24, 1994 (hereinafter the synchronization PCT application). In the synchronization PCT application, time-based audio and video sequences are described as being synchronized together where the video player is synchronized to the audio player. If the audio player speeds up, the video player follows by speeding up in a lockstep fashion.
However, merely using fast computer systems or common software clocks presupposes that the viewer has such a computer system and that there is always enough processing power to service both the audio player and the video player in time so to present synchronous multimedia data to the viewer. In other words, if the computer system is not fast enough or some other competing process grabs the needed processing cycles, the computer system may still have problems maintaining synchronization.
Trying to make the video go smoothly and fast enough is not trivial if the computer system is slow or under-powered, even without competing processes stealing precious compute cycles. Solving synchronization problems with under-powered computer systems has been attempted using inferior decoding methods and by simply dropping frames of video data altogether to maintain synchronization with the audio data. However, these solutions also impose problems for the viewer. When using an inferior decoding method, the video data is generally not completely decoded as a compromise for better performance. This typically results in a blurred or blocky displayed picture, which is less than desirable for the viewer. When merely dropping frames in an attempt to catch up and get back in synchronization with the audio data, the resulting picture viewed on the computer monitor is typically jerky in appearance. Either or both of these techniques are normally preferable to allowing the synchronization of audio and video to continue to drift off. However, the viewer is still stuck with either blurred video, a jerky appearance of the video, or both.
Additionally, where it takes a significant effort to decode the audio data, it is necessary to ensure that enough processor time is devoted to this audio process in the overall multimedia playing scheme in order to avoid audio breaks. Generally, the audio is decoded some time in advance so that there is typically a few seconds of buffered decoded audio data within an audio buffer ready to be played by the sound system within the computer. If no further audio data is decoded for this length of time, then eventually the sound system runs out of decoded audio data in the buffer. As a result, the sound stops abruptly, right in the middle of wherever was playing, usually with a slight click or pop. When decoded audio data becomes available again within the audio buffer, the sound system resumes playing, again usually with a pop. Such pops and silences are intrusive, undesirable, and very unpleasant to the viewer.
Therefore, there is a need for a system for maintaining the synchronization between audio and video data (1) while degrading the presented video as little as possible, (2) while avoiding breaks in the audio, (3) while minimizing the amount of dropped video frames, and (4) that is adaptive to the apparent processing power of the system while avoiding jerky video appearances when adapting to the apparent processing power of the system.