The modern communications era has brought about a tremendous expansion of wireline and wireless networks. Computer networks, television networks, and telephony networks are experiencing an unprecedented technological expansion, fueled by consumer demand. Wireless and mobile networking technologies have addressed related consumer demands, while providing more flexibility and immediacy of information transfer.
This expansion of networks and growth of modes of communication has allowed for the creation and delivery of increasingly complex digital videos such as via downloading media content files or streaming of digital video files from a remote network device to a local network terminal. As the expansion in bandwidth and reach of networks has allowed for increasing complexity of digital videos and the delivery of video content to even mobile device terminals, the capabilities of computer hardware and software, such as that used in personal computers and in mobile devices such as PDAs and cell phones, have necessarily increased to keep up with the ever increasing demands of modern digital video playback.
In spite of developments in hardware and software, video content can always be created that exceeds the capabilities of a system regardless of the software or hardware involved, which codecs are used, or the resolution of the video content. When video content exceeds the capabilities of the system on which it is playing, the result is most often the loss of what is known as audio/video synchronization or AV sync (also sometimes referred to as pacing). Ideally, when video content is played on a device the audio and video tracks will remain in synchronization with each other so that, for example, when a person in the video speaks, the audio track of the voice will be in synchronization with the video track of the person opening his mouth, also commonly referred to as lip synchronization. However, when the video complexity exceeds the capabilities of the system on which it is playing, in minor cases lip synchronization may be lost and the audio of the person's voice may play either slightly before or slightly after the video showing the person moving his mouth. In worse cases where the requirements to play the video content more greatly exceed system capabilities, video playback may periodically freeze or hang up all together while the audio track may or may not continue to play. Such losses of AV sync are detrimental to viewer experience.
In order to attempt to maintain AV sync in situations where video content exceeds a system's abilities, several algorithms in the past have been proposed wherein portions of the video and/or audio track, known as frames, are dropped. While dropping audio frames may aid in maintaining AV sync, the experience for the viewer is greatly diminished as missing audio frames are almost always evident to a viewer. On the other hand, dropping one or more video frames is not necessarily noticeable to a viewer in most cases. For example, American television standards dictate the use of 30 video frames per second, while European standards specify the use of 25 frames per second for video playback and viewers are unable to discern any evident difference between the two playback standards. Consequently, in order to best maintain AV sync on a system when video content exceeds the system's abilities, a desirable approach is to drop one or more video frames while not dropping any audio frames.
Dropping of video frames may occur at two levels: (1) before being postprocessed or (2) before being decoded. In the first scenario where video frames are dropped after being decoded, but before being postprocessed, each time a frame is slightly late in its arrival to the postprocessor, it will not be postprocessed and rendered, which in theory will save enough time for the next frame to be postprocessed and displayed on time. In the second scenario, frames arriving late to the decoder may be dropped before being decoded.
To better understand how in the second scenario video frames may be dropped at the decoder level before being decoded, it is first necessary to understand how video frames are classified. For simple purposes, the written description of this application will classify video frames as one of two types, either key frames or non-key frames. Key frames are also known in the art as intra-frames or i-frames and are self contained video frames containing all information necessary to fully render a single frame of video without referencing any previous or subsequent frames. Non-key frames, also known as predictive frames (p-frames) or bi-directional frames (b-frames), on the other hand, are not self-contained and include data that may reference one or more previous or subsequent frames. For example, during a short video segment, one object may gradually move over a sequence of frames while the rest of the background video content maintains the same positioning. Hence, non-key frames subsequent to the last key frame may only contain data necessary to describe the movement in the position of the one moving object without any data on how to fully render the remaining background of the video content. Thus, since video frames may refer to previous and/or subsequent video frames in a sequence, whenever a frame is dropped at the decoder level there may be visual consequences as dropping a frame will cause any frames referring to the dropped frame to render incorrectly as they are missing some part of information necessary to fully render. The visual quality then diminishes rapidly when any frame is dropped and can only be restored when a key-frame is decoded and maintained as long as no subsequent non-key frames are dropped.
The most common prior solution to the resulting diminishment of visual quality when a frame is dropped at the decoder level prior to now was to drop all consecutive frames until the next key frame once any non-key frame was late. Under this approach, display of incoming video frames will stop once any non-key frame is dropped until the arrival of the next key frame. If the video clip contains few key frames, a viewer may get the impression that playback has stopped or hanged.
Accordingly, it would be advantageous to provide for an improved approach to maintaining AV synchronization which allows for adaptive decoding and dropping of video frames so as to substantially maintain AV synchronization while causing as little negative impact to viewer experience as possible by preventing long pauses in video playback that may otherwise be caused by prior approaches to dropping sequences of non-key frames. Such an improved approach may help to improve the capabilities of a system beyond its comfort zone, allowing for optimal playback of video content in circumstances in which it would otherwise be unable to handle the complexity of the video content.