1. Field
This invention relates to a method and an apparatus for encoding and decoding scalable video data.
2. Background
Due to the explosive growth and great success of the Internet and wireless communication, as well as increasing demand for multimedia services, streaming media over the Internet and mobile/wireless channels have drawn tremendous attention. In heterogeneous Internet Protocol (IP) networks, video is provided by a server and can be streamed by one or more clients. Wired connections include dial-up, ISDN, cable, xDSL, fiber, LAN (local area network), WAN (wide area network) and others. The transmission mode can be either uni-cast or multi-cast. The variety of individual client devices, including PDA (personal digital assistant), laptop, desktop, set-top box, TV, HDTV (high-definition television), mobile phone and others, requires bitstreams of different bandwidths simultaneously for the same content. The connection bandwidth could vary quickly with the time (from 9.6 kbps to 100 Mbps and above), and can be faster than a server's reaction.
Similar to the heterogeneous IP network is mobile/wireless communication. Transport of multimedia content over mobile/wireless channels is very challenging because these channels are often severely impaired due to multi-path fading, shadowing, inter-symbol interference, and noise disturbances. Some other reasons such as mobility and competing traffic also cause bandwidth variations and loss. The channel noise and the number of users being served determine the time-varying property of channel environments. In addition to environmental conditions, the destination network can vary from second to third generation cellular networks to broadband data-only networks due to geographic location as well as mobile roaming. All these variables call for adaptive rate adjustment for multimedia content, even on the fly. Thus, successful transmission of video over heterogeneous wired/wireless networks requires efficient coding, as well as adaptability to varying network conditions, device characteristics, and user preferences, while also being resilient to losses.
To meet different user requirements and to adapt to channel variation, one could generate multiple independent versions of bitstreams, each meeting one class of constraints based on transmission bandwidth, user display and/or computational capability, but this is not efficient for server storage and multicast application. In scalable coding, where a single macro-bitstream accommodating high-end users is built at the server, the bitstreams for low-end applications are embedded as subsets of the macro-bitstream. As such, a single bitstream can be adapted to diverse application environments by selectively transmitting sub-bitstreams. Another advantage provided by scalable coding is for robust video transmissions on error prone channels. Error protection and error concealment can be easily handled. A more reliable transmission channel or a better error protection can be applied to base layer bits that contain the most significant information.
There are spatial, temporal and signal to noise ratio (SNR) scalabilities in hybrid coders like MPEG-1, MPEG-2, MPEG-4 (collectively referred to as MPEG-x), H.261, H.262, H.263, and H.264 (collectively referred to as H.26x). In hybrid coding, temporal redundancy is removed by motion-compensated prediction (MCP). Video is typically divided into a series of groups of pictures (GOP), where each GOP begins with an intra-coded frame (I) followed by an arrangement of forward (and/or backward) predicted frames (P) and bi-directional predicted frames (B). Both P frames and B frames are inter-predicted frames employing MCP. A base layer can contain the most significant information of I frames, P frames or B frames at a lower quality level, and an enhancement layer can contain higher quality information of the same frames or additional temporal scaling frames not contained in the base layer. SNR scalability can be accomplished at a decoder by selectively omitting decoding of the higher quality data in the enhancement layer while decoding the base layer data. Depending on how the data is parsed between the base layer and the enhancement layer, decoding of the base layer plus enhancement layer data can introduce increased complexity and memory requirements. Increased computational complexity and increased memory requirements can be detrimental to the performance of power limited and computationally limited devices such as PDA's (personal digital assistants), mobile phones and the like. What is desired is that the decoding of the base layer plus the enhancement layers does not significantly increase the computational complexity and memory requirements of such devices.