Modern media content distribution systems such as mobile video transmission systems are becoming increasingly popular. The underlying access networks are typically characterized by a varying connection quality and a wide range of terminal devices acting as recipients of the media content. The varying connection quality is, inter alia, a result of adaptive resource sharing mechanisms of these networks addressing the time varying data throughput requirements of a varying number of user terminal devices. As the terminal devices may range from mobile telephones with small screens and restricted processing power to high-end Personal Computers (PCs) with high-definition displays, the terminal devices will generally have different capabilities and requirements.
Bitstream scalability for media content is a desirable feature in such media content distribution systems. The need for scalability arises from graceful degradation transmission requirements and from adaptation requirements for spatial formats, bit rates or power, to name a few. To fulfill these requirements, it is beneficial to simultaneously transmit or store the media content in different spatial or temporal resolutions or qualities, which is the basis of bitstream scalability.
Scalable video coding (SVC) is one solution to the scalability needs posed by the characteristics of modern video transmission systems. The SVC standard as specified in Annex G of H.264/Advanced Video Coding (AVC) allows the construction of bitstreams that contain scaling sub-bitstreams conforming to H.264/AVC. H.264/AVC is a video compression standard equivalent to the Moving Pictures Expert Group (MPEG)-4 AVC (MPEG-4 AVC) standard.
The SVC standard encompasses various scaling approaches. For temporal bitstream scalability, i.e., the generation of a sub-bitstream with a smaller temporal sampling rate than the original bitstream, complete access units are removed from the bitstream when deriving the sub-bitstream. In this case, high-level syntax and inter prediction reference pictures in the bitstream are constructed accordingly. For spatial and quality bitstream scalability, i.e. the generation of a sub-bitstream with lower spatial resolution or quality than the original bitstream, Network Abstraction Layer (NAL) units are removed from the bitstream when deriving the sub-bitstream. In this case, inter-layer prediction, i.e., the prediction of the higher spatial resolution or quality bitstream based on information contained in the lower spatial resolution or quality bitstream, is typically used for efficient encoding.
In the SVC standard, the lower spatial resolution or quality sub-bitstream is also referred to as Base Layer (BL) sub-bitstream, while the higher spatial resolution or quality sub-bitstream is also referred to as Enhancement Layer (EL) sub-bitstream. It should be noted that in scenarios with multiple sub-bitstreams of different higher spatial resolution or quality, two or more EL sub-bitstreams may be provided in total.
Each image of an SVC video image sequence is represented as so-called “frame” (i.e., as an encoded representation of this image). Each SVC sub-bitstream is represented as a sequence of so called “sub-frames”. Each SVC sub-frame constitutes either a full SVC frame or a fraction of a SVC frame. In other words, each SVC frame either is represented as a single data item (i.e., one BL “sub-frame” or one EL “sub-frame”) or is sub-divided in at least two separate data items, i.e., in one BL “sub-frame” containing only the BL information associated with the respective frame and (at least) one EL “sub-frame” containing the EL information associated with the respective frame. In the SVC bitstream an EL sub-frame may temporally correspond to a certain BL sub-frame. If only the BL sub-frames are decoded, then the video content can be rendered at a basis resolution or quality (e.g., at Quarter Video Graphics Array, or QVGA, resolution). If, on the other hand, both the BL and the EL sub-frames are decoded, then the video content can be rendered at a higher resolution or quality (e.g., at VGA resolution).
The SVC file format for storing the sub-frames of the BL and EL sub-bitstreams is derived from the MPEG-4 file format. That is, each SVC media file is divided in a media data container and a track container (also called movie container). The media data container is used to store in so-called BL samples the sub-frames of the BL sub-bitstream (“BL samples”) and in optional EL samples (at least) the sub-frames of one or more EL sub-bitstreams. The track container, on the other hand, specifies one or more media tracks, with each media track representing one media stream. Each media track contains references to a sequence of samples stored in the media data container (e.g., a time-to-sample table).
Using this SVC file format, access to different SVC layers and combinations of SVC layers can be indicated by use of multiple (i.e., at least two) media tracks. For instance, if the SVC encoded video content comprises a BL sub-bitstream and one EL sub-bitstream, then the media file would comprise a first media track representing the BL sub-bitstream only (“BL track”), while a second media track would represent both the BL and the EL sub-bitstream (“EL track”).
There exist two strategies to store the SVC encoded video information in the media data container, which can be referred to as the “efficient strategy” and the “inefficient” strategy. According to both strategies, the BL sub-frames are stored as distinct BL samples in the media data container, and these samples are referenced by the BL track. The inefficient strategy and the efficient strategy differ in the way the EL samples containing the EL sub-frames are organized in relation to the corresponding BL sub-frames.
According to the inefficient strategy, BL sub-frames and EL sub-frames corresponding to a specific point in time (i.e., constituting a specific frame) will both be stored in the same EL sample. These samples are then referenced in their specific sequence in the EL track. Since the BL sub-frames referenced in the BL track are additionally stored in distinct BL samples, they have in fact to be replicated within the media data container. This replication results in an inefficiency as regards memory usage.
If the efficient strategy is used, then the BL sub-frames are not replicated to be encapsulated together with EL sub-frames in the corresponding EL samples referenced in the EL track. Rather, each sample containing an EL sub-frame is provided with a track reference index in a so-called “extractor”. The track reference index in the extractor refers to the BL track and thus allows to identify the BL sample in the media container that contains the BL sub-frame temporally associated with the specific EL sub-frame. As in the inefficient strategy, the samples containing the EL sub-frames are referenced in the EL track. The associated BL sub-frames are thus determined by dereferencing the extractors in the EL samples. To this end, additional meta information (the so-called “‘scal’ type track references”) is stored in the track container. By reading the track reference index included in the extractor and looking up the associated BL track reference using the meta information, the BL track can be determined. Then, in a next step, the BL sample containing the temporally associated BL sub-frame can be found using the time-to-sample table of the BL track.
In order to control the consumption of media content such as SVC encoded video content, Digital Rights Management (DRM) techniques may be employed. Generally, DRM techniques rely on encrypting the media content to control its consumption. For this reason, it has been investigated if media content encryption would also be feasible to protect SVC encoded media content. In this connection it has been found that problems arise in case the decryption key is not available to (or not usable or processable by) the media content recipient at the time the encrypted media content is received.
Specifically, the receiving terminal device cannot store the encrypted EL sub-bitstream using the efficient strategy described above since the terminal device cannot (or at least not efficiently) add the extractors to the encrypted EL sub-frames to construct the EL samples. The problems are even more pronounced in case the BL and EL sub-bitstreams are encrypted using different encryption keys. In such a situation, also the inefficient strategy can not (or at least not efficiently) be employed since different decryption keys would be required to decrypt one EL sample.