1. Field
This invention relates to a method and apparatus for encoding and decoding scalable video data with efficient reuse of base layer modules for construction of enhancement layer frames.
2. Background
Due to the explosive growth and great success of the Internet and wireless communication, as well as increasing demand for multimedia services, streaming media over the Internet and mobile/wireless channels has drawn tremendous attention. In heterogeneous Internet Protocol (IP) networks, video is provided by a server and can be streamed by one or more clients. Wired connections include dial-up, ISDN, cable, xDSL, fiber, LAN (local area network), WAN (wide area network) and others. The transmission mode can be either uni-cast or multi-cast. The variety of individual client devices, including PDA (personal digital assistant), laptop, desktop, set-top box, TV, HDTV (high-definition television), mobile phone and others, requires bitstreams of different bandwidths simultaneously for the same content. The connection bandwidth could vary quickly with the time (from 9.6 kbps to 100 Mbps and above), and can be faster than a server's reaction.
Similar to the heterogeneous IP network is mobile/wireless communication. Transport of multimedia content over mobile/wireless channels is very challenging because these channels are often severely impaired due to effects such as multi-path fading, shadowing, inter-symbol interference, and noise disturbances. Some other reasons, such as mobility and competing traffic, also cause bandwidth variations and loss. Factors such as channel noise and the number of users being served determine the time-varying property of channel environments. In addition to environmental conditions, the destination network can vary from second to third generation cellular networks to broadband data-only networks due to geographic location as well as mobile roaming. All these variables call for adaptive rate adjustment for multimedia content, even on the fly. Thus, successful transmission of video over heterogeneous wired/wireless networks requires efficient coding, as well as adaptability to varying network conditions, device characteristics, and user preferences, while also being resilient to losses.
To meet different user requirements and to adapt to channel variation, one could generate multiple independent versions of bitstreams, each meeting one class of constraints based on transmission bandwidth, user display and computational capability. But this is not efficient for server storage or network capacity. In scalable coding, where a single macro-bitstream accommodating high-end users is built at the server, the bitstreams for low-end applications are embedded as subsets of the macro-bitstream. As such, a single bitstream can be adapted to diverse application environments by selectively transmitting sub-bitstreams. Another advantage provided by scalable coding is for robust video transmissions on error prone channels. Error protection and error concealment can be easily handled. A more reliable transmission channel or a better error protection can be applied to base-layer bits that contain the most significant information.
There are spatial, temporal and signal to noise ratio (SNR) scalabilities in hybrid coders like MPEG-1, MPEG-2, MPEG-4 (collectively referred to as MPEG-x), H.261, H.262, H.263, and H.264 (collectively referred to as H.26x). In hybrid coding, temporal redundancy is removed by motion-compensated prediction (MCP). A video is typically divided into a series of groups of pictures (GOP), where each GOP begins with an intra-coded frame (I) followed by an arrangement of forward predicted frames (P) and bidirectional predicted frames (B). Both P-frames and B-frames are inter-frames. The B frame is the key to temporal scalability in most MPEG like coders. However, some profiles, such as the MPEG-4 Simple profile and the H.264 Baseline Profile, do not support B frames.
In MPEG-4, profiles and levels provide a means of defining subsets of the syntax and semantics based on the decoder capabilities required to decode a particular bitstream. A profile is a defined as a sub-set of the entire bitstream syntax. A level is a defined set of constraints imposed on parameters in the bitstream. For any given profile, levels generally correspond to decoder processing load and memory capability. So profiles and levels specify restrictions on bitstreams and hence place limits on the capabilities of decoding the bitstreams. In general, a decoder shall be deemed to be conformant to a given profile at a given level if it is able to properly decode all allowed values of all syntactic elements as specified by that profile at that level.
Evolutionary development, or migration, of modern microprocessor chipsets can be accomplished in an efficient manner when requirements can be met while keeping changes to software, firmware and hardware to a minimum. As discussed above, the MPEG-4 Simple profile and H.264 Baseline profile do not support B Frames for temporal scalability. Therefore, chipsets that were developed in conformance to these profiles may not support B Frames. With an increase in the popularity and demand of higher rate multimedia, and the networks supporting higher rate multimedia, an efficient migration path from MPEG-4 Simple profile or H.264 Baseline profile to a profile offering temporal scalability with B Frames is needed. The MPEG-4 standard is described in ISO/IEC 14496-2. The H.264 standard is described in [ISO/IEC 14496-10].