Digital video or other digital image or content applications (such as, but not limited to digital television (DTV) broadcasts, music broadcast, digital movie production, video advertisements, multi-person video conferencing) all use compressed digital video at its core. For example, compression technologies such as ITU-standard Motion Picture Experts Group (MPEG) or Microsoft's VC-1, or Apple's QuickTime are some of the basis of majority of the schemes in use today. Most of these technologies use compression schemes where not all the frame's contain full-pictures that can be decoded stand-alone. Some frames (for example, two frames out of thirty frames, in the case of a common MPEG-2 specification) contain full pictures whereas other frames merely contain “differential” data—for example, information changes between complete pictures. This is how compression efficiency is typically achieved.
However, these types of encodings introduce some significant issues in certain common use cases. For example, consider the act of switching TV channels on a digital television network. In the case of analog television systems, this switching would be instantaneous. However, in the digital counterpart, depending on the instance in time, the new channel the user switches into may not have a complete frame for the client (the television set-top box) to start decoding the frame—the data might be differential in nature and without a base reference (a complete frame of data), the decoder is unable to decode the data stream, until a complete frame of data arrives. Depending on how the picture is encoded, the next complete frame of data may arrive in a few milliseconds or in few seconds. Additional delays are introduced in the various processing that happens within the network—for example, tuning to a channel in a digital Internet Protocol network (IP network) incurs what is called network multicast join (Internet Gateway Multicast Protocol (“IGMP) join”) delays. Cumulatively, all of these delays add up and in certain situations become unacceptable—the user experiences a long delay in going from one channel to another.
In order to achieve higher compression efficiencies for video broadcasts, certain compression schemes and specifications allow for “stand-alone frames” (sometimes referred to key frames, as I-frames for Intra Coded frames, or as RAPs for Random Access Points) to be spaced out farther apart in time and from which a complete image may be generated without prediction or interpolation of other frames. All other frames are predicted using forward prediction (called P-frames where the “P denotes prediction) or bidirectional prediction (called B-frames where the “B” denotes bi-direction). In these schemes, specifications, and techniques, the predication is the key to achieving better compression: however, the resultant frames cannot be decoded independently. For example, the H.264 video codec used in MPEG-4 Part 10 implementation (See for example, the MPEG-4 ISO/IEC 14496 specification which is hereby incorporated by reference) allows I-frames or Random Access points to be spaced out as far as 2 to 8 seconds apart. The resultant digital video is “efficient” in terms of the amount or number of bits used to represent the scenes. However, channel change latency becomes significantly longer as it takes decoder a larger amount of time (longer time period) to get a standalone picture that does not require any frames from the past or future to start decoding. The resultant system becomes unacceptable in terms of user experience.
There are a few conventional approaches attempted to solve this Problem, but these approaches have not been entirely successful. In particular they have not been successful either for reducing the switching delays associated with changes in content fee selection for regular programming, nor for the switching requirements for desired insertion of local, customized, or other advertisements or other content into the digital television stream.
The first and simplest of these conventional approaches has been to digitally encode the video with standalone frames (I-frames or RAPs) that are spaced closer. For example, in the MPEG-2 specification used in DVB standards, the I-frames are spaced out 500 millisecond (msec) each. Thus, when a decoder starts receiving data upon a new channel selection, the worst case latency to begin decoding the stream is 500 msec plus the rest of the overhead (e.g. IGMP join). The rest of the overhead may for example consist of delays in recognizing the remote control key press and the time it takes to issue an IGMP join and be as long as a few microseconds to several milliseconds. However, as mentioned before, this conventional scheme does not yield the most efficient streams in terms of the amount of data used to represent the scenes. For example, for a standard definition video signal, this scheme may yield a compression efficiency that produces a digital bit-stream between 2.5 to 4.5 megabits/second.
One strategy to overcome the latency is to decode the stream and re-encode the stream such that when a client joins a channel, the data always starts with a standalone frame. This scheme requires significant server processor resources, such as central processing unit (CPU) resources, for decoding and re-encoding and introduces significant bandwidth overhead because more I-frames are generated and have to be transmitted (for example, the resultant data rate may go up from 50% to 100% or more of the compared to the original data stream) Another disadvantage of this scheme is that it is not compatible with features such as encryption because decoding requires access to encryption keys and re-encoding requires access to the original encryption scheme. Both of these are impractical and introduces significant security issues.
An alternate strategy is a hybrid solution, where a client that is interested in joining a new channel starts with a unicast stream first to reduce the latency of the channel change and then switches to a multicast stream at an appropriate point in the future. A unicast stream or model is a unique stream that is allocated for each individual user, compared to a multicast stream that is the same stream that a number of subscribers to that channel join. The unicast stream will need to start with a complete I-frame or RAP so that the client can start decoding the video data without any delay. However, a unicast stream model requires a lot of server and network resources and is not very cost effective. For example, in a very large network, each user will have to be assigned his/her own unique bandwidth.
However, for this conventional attempted hybrid solution to work well, the client needs to buffer enough data so that the transition from unicast to multicast can take place without any artifacts. Artifacts such as jitter, frozen frame, or black screen may typically result from transitioning at a non-RAP boundary or from lack of data. Client buffer requirements may vary depending, for example on bit-rate of the content, distance between two RAPs, or other factors. This implies that the unicast stream would need to be sent faster than the bitrate of the channel to build up the buffer on the client. The burden of a switch from the unicast stream to the multicast stream would be on the client rather than on the server. This would require a smart client whereas it is preferred to only require a thin or less intelligent client. Moreover, the data viewed by any client will always be delayed and the maximum delay would roughly equal the time between two RAPs.
Therefore in these conventional attempts to solve this problem have not been entirely successful, and/or impose additional undesired requirements on the client, so that new methods, system, and devices are needed to overcome the limitations in the prior-art.