High Efficiency Video Coding (HEVC) is a video coding standard being developed in Joint Collaborative Team-Video Coding (JCT-VC). JCT-VC is a collaborative project between Moving Picture Experts Group (MPEG) and International Telecommunication Union-Telecommunication Standardization Sector (ITU-T). Currently, an HEVC Model (HM) is defined that includes a number of tools and is considerably more efficient than H.264/Advanced Video Coding (AVC).
HEVC is a block based hybrid video coded that uses both inter prediction (prediction from previous coded pictures) and intra prediction (prediction from previous coded pixels in the same picture). Each picture is divided into quadratic treeblocks (corresponding to macroblocks in H.264/AVC) that can be of size 4×4, 8×8, 16×16, 32×32 or 64×64 pixels. A variable CtbSize is used to denote the size of treeblocks expressed as number of pixels of the treeblocks in one dimension i.e. 4, 8, 16, 32 or 64.
Regular slices are similar as in H.264/AVC. Each regular slice is encapsulated in its own Network Abstraction Layer (NAL) unit, and in-picture prediction (intra sample prediction, motion information prediction, coding mode prediction) and entropy coding dependency across slice boundaries are disabled. Thus a regular slice can be reconstructed independently from other regular slices within the same picture. Since the treeblock, which is a basic unit in HEVC, can be of a relatively big size e.g., 64×64, a concept of “fine granularity slices” is included in HEVC to allow for Maximum Transmission Unit (MTU) size matching through slice boundaries within a treeblock, as a special form of regular slices. The slice granularity is signaled in a picture parameter set, whereas the address of a fine granularity slice is still signaled in a slice header.
The regular slice is the only tool that can be used for parallelization in H.264/AVC. Parallelization implies that parts of a single picture can be encoded and decoded in parallel as illustrated in FIG. 1 where threaded decoding can be used using slices. Regular slices based parallelization does not require much inter-processor or inter-core communication. However, for the same reason, regular slices can require some coding overhead due to the bit cost of the slice header and due to the lack of prediction across the slice border. Further, regular slices (in contrast to some of the other tools mentioned below) also serve as the key mechanism for bitstream partitioning to match MTU size requirements, due to the in-picture independence of regular slices and that each regular slice is encapsulated in its own NAL unit. In many cases, the goal of parallelization and the goal of MTU size matching place contradicting demands to the slice layout in a picture. The realization of this situation led to the development of the parallelization tools mentioned below.
In wavefront parallel processing (WPP), the picture is partitioned into single rows of treeblocks. Entropy decoding and prediction are allowed to use data from treeblocks in other partitions. Parallel processing is possible through parallel decoding of rows of treeblocks, where the start of the decoding of a row is delayed by two treeblocks, so to ensure that data related to a treeblock above and to the right of the subject treeblock is available before the subject treeblock is being decoded. Using this staggered start (which appears like a wavefront when represented graphically as illustrated in FIG. 2), parallelization is possible with up to as many processors/cores as the picture contains treeblock rows. Due to the permissiveness of in-picture prediction between neighboring treeblock rows within a picture, the required inter-processor/inter-core communication to enable in-picture prediction can be substantial. The WPP partitioning does not result in the production of additional NAL units compared to when it is not applied, thus WPP cannot be used for MTU size matching. A wavefront segment contains exactly one line of treeblocks.
Tiles define horizontal and vertical boundaries that partition a picture into tile columns and rows. That implies that the tiles in HEVC divide a picture into areas with a defined width and height as illustrated in FIG. 3. Each area of the tiles consists of an integer number of treeblocks that are processed in raster scan order. The tiles themselves are processed in raster scan order throughout the picture. The exact tile configuration or tile information (number of tiles, width and height of each tile etc) can be signaled in a sequence parameter set (SPS) and in a picture parameter set (PPS). The tile information contains the width, height and position of each tile in a picture. This means that if the coordinates of a block is known, it is also known what tile the block belongs to.
For simplicity, restrictions on the application of the different picture partitioning schemes are specified in HEVC. Tiles and WPP may not be applied at the same time. Furthermore, for each slice and tile, either or both of the following conditions must be fulfilled: 1) all coded treeblocks in a slice belong to the same tile; 2) all coded treeblocks in a tile belong to the same slice.
The Sequence Parameter Set (SPS) holds information that is valid for an entire coded video sequence. Specifically it holds the syntax elements profile_idc and level_idc that are used to indicate which HEVC profile and HEVC level a bitstream conforms to. The HEVC profiles and the HEVC levels specify restrictions on bitstreams and hence limits on the capabilities needed to decode the bitstreams. The HEVC profiles and the HEVC levels may also be used to indicate interoperability points between individual decoder implementations. The HEVC level enforces restrictions on the bitstream for example on the Picture size (denoted MaxLumaFS expressed in luma samples) and sample rate (denoted MaxLumaPR expressed in luma samples per second) as well as max bit rate (denoted MaxBR expressed in bits per second) and max coded picture buffer size (denoted Max CPB size expressed in bits).
The Picture Parameter Set (PPS) holds information that is valid for some (or all) pictures in a coded video sequence. The syntax element tiles_or_entropy_coding_sync_idc controls the usage of wavefronts and tiles and it is required to have same value in all PPSs that are active in the same coded video sequence.
Moreover, both HEVC and H.264 define a video usability information (VUI) syntax structure, that can be present in a sequence parameter set and contains parameters that do not affect the decoding process, i.e. do not affect the pixel values. Supplemental Enhancement Information (SEI) is another structure that can be present in any access unit and that contains information that does not affect the decoding process.
Hence, as mentioned above, compared to H.264/AVC, HEVC provides better possibilities for parallelization. Specifically tiles and WPP are tools developed for parallelization purposes. Both were originally designed for encoder parallelization but they may also be used for decoder parallelization.
When tiles are being used for encoder parallelism, the encoder first chooses a tile partitioning. Since tile boundaries break all predictions between the tiles, the encoder can assign the encoding of multiple tiles to multiple threads. As soon as there are at least two tiles, multiple thread encoding can be done.
Accordingly, in this context, the fact that a number of threads can be used, implies that the actual workload of the encoding/decoding process can be divided into separate “processes” that are performed independently of each other, i.e. they can be performed in parallel in separate threads as shown in FIG. 3.
HEVC defines two types of entry points for parallel decoding. Entry points can be used by a decoder to find the position in the bitstream where the bits for a tile or substream starts. The first type is entry points offsets. Those are listed in the slice header and indicates starting points of one or more tiles that are contained in the slice. The second type is entry point markers which separates tiles in the bitstream. An entry point marker is a specific codeword (start code) which cannot occur anywhere else in the bitstream.
Thus for decoder parallelism to work, there needs to be entry points in the bitstream. For parallel encoding, there does not need to be entry points, the encoder can just stitch the bitstream together after the encoding of the tiles/substreams are complete. However, the decoder needs to know where each tile starts in the bitstream in order to decode in parallel. If an encoder only wants to encode in parallel but does not want to enable parallel decoding, it could omit the entry points, but if it also wants to enable decoding in parallel it must insert entry points.
There are different ways of establishing multimedia session including HEVC video.
Dynamic Adaptive Streaming over HTTP (DASH) is an adaptive bitrate streaming technology where a multimedia file is partitioned into one or more segments and delivered to a client using HTTP. A media presentation description (MPD) describes segment information (timing, URL, media characteristics such as video resolution and bit rates). Segments can contain any media data, however the specification provides specific guidance and formats for use with two types of containers: MPEG-4 file format or MPEG-2 Transport Stream.
DASH is audio/video codec agnostic. One or more representations (i.e., versions at different resolutions or bit rates) of multimedia files are typically available, and selection can be made based on network conditions, device capabilities and user preferences, enabling adaptive bitrate streaming.
“Offer/Answer Model with the Session Description Protocol (SDP)” defines a mechanism by which two entities can make use of the Session Description Protocol (SDP) to arrive at a common view of a multimedia session between them. In the model, one participant offers the other a description of the desired session from their perspective, and the other participant answers with the desired session from their perspective. This offer/answer model is most useful in unicast sessions where information from both participants is needed for the complete view of the session. The offer/answer model is used by protocols like the Session Initiation Protocol (SIP).