Digital video capabilities can be incorporated into a wide range of devices, including digital televisions, digital direct broadcast systems, wireless broadcast systems, personal digital assistants (PDAs), laptop or desktop computers, video cameras, digital recording devices, video gaming devices, video game consoles, cellular or satellite radio telephones, and the like. Digital video devices may implement video compression techniques, such as those described in standards like MPEG-2, MPEG-4, both available from the International Organization for Standardization (“ISO”) 1, ch. de la Voie-Creuse, Case postale 56, CH-1211 Geneva 20, Switzerland, or www.iso.org, or ITU-T H.264/MPEG-4, Part 10, Advanced Video Coding (“AVC”), available from the International Telecommunication Union (“ITU”), Place de Nations, CH-1211 Geneva 20, Switzerland or www.itu.int, each of which is incorporated herein by reference in their entirety, or according to other standard or non-standard specifications, to encode and/or decode digital video information efficiently. Still other compression techniques may be developed in the future or are presently under development. For example, a new video compression standard known as HEVC/H.265 is under development in the JCT-VC committee. The HEVC/H.265 working draft is set out in “Wiegand et. al., “WD3: Working Draft 3 of High-Efficiency Video Coding, JCTVC-E603” March 2011, henceforth referred to as “WD3” and incorporated herein by reference in its entirety.
A video encoder can receive uncoded video information for processing in any suitable format, which may be a digital format conforming to ITU-R BT 601 (available from the International Telecommunications Union, Place des Nations, 1211 Geneva 20, Switzerland, www.itu.int, and which is incorporated herein by reference in its entirety) or in some other digital format. The uncoded video may be organized both spatially into pixel values arranged in one or more two-dimensional matrices as well as temporally into a series of uncoded pictures, with each uncoded picture comprising one or more of the above-mentioned two-dimensional matrices of pixel values. Further, each pixel may comprise a number of separate components used to represent color in digital format. One common format for uncoded video that is input to a video encoder has, for each group of four pixels, four luminance samples which contain information regarding the brightness/lightness or darkness of the pixels, and two chrominance samples which contain color information (e.g., YCrCb 4:2:0).
One function of video encoders is to translate (more generally “transform”) uncoded pictures into a bitstream, packet stream, NAL unit stream, or other suitable transmission format (all referred to as “bitstream” henceforth), with goals such as reducing the amount of redundancy encoded into the bitstream to thereby increase transmission rates, increasing the resilience of the bitstream to suppress bit errors or packet erasures that may occur during transmission (collectively known as “error resilience”), or other application-specific goals. Embodiments of the present invention provide for at least one of the removal or reduction of redundancy, the increase in error resilience, and implementability of video encoders and/or associated decoders in parallel processing architectures.
One function of video decoders is to receive as its input a coded video in the form of a bitstream that may have been produced by a video encoder conforming to the same video compression standard. The video encoder then translates (more generally “transforms”) the received coded bitstream into uncoded video information that may be displayed, stored, or otherwise handled.
Both video encoders and video decoders may be implemented using hardware and/or software configurations, including combinations of both hardware and software. Implementations of either or both may include the use of programmable hardware components such as general purpose central processing units (CPUs), such as found in personal computers (PCs), embedded processors, graphic card processors, digital signal processors (DSPs), field programmable gate arrays (FPGAs), or others. To implement at least parts of the video encoding or decoding, instructions may be needed, and those instructions may be stored and distributed using one or more non-transitory computer readable media. Computer readable media choices include compact disc read-only memory (CD-ROM), digital videodisc read-only memory (DVD-ROM), memory stick, embedded ROM, or others.
Video compression and decompression refer to certain operations performed in a video encoder and/or decoder. A video decoder may perform all, or a subset of, the inverse operations of the encoding operations. Unless otherwise noted, techniques of video encoding described herein are intended also to encompass the inverse of the described video encoding techniques (namely associated video decoding techniques).
The uncompressed, digital representation of video can be viewed as a sample stream, wherein the samples can be processed by the video display in scan order. One type of boundary often occurring in this sample stream is the boundary between pictures in the sample stream. Many video compression standards recognize this boundary and often divide the coded bitstream at these boundaries, for example, through the insertion of a picture header or other metadata at the beginning of each uncoded picture. Other boundaries that may occur in the sample stream include slice and tile boundaries, which may occur within an uncoded picture, as described below.
Prediction in video coding can occur at many levels.
One level is referred to henceforth as the “entropy coding level” and the prediction at that level is referred to as “encoding prediction”. In this level, the decoding of an entropy coded symbol may require the successful decoding of previous entropy coded symbols. All or nearly all current video compression standards break the encoding prediction at both the picture and the slice level. That is, at the detection of a picture or slice header in the bitstream (or equivalent), the entropy coding related states used in the entropy coding are reset to an initialization state. One example for entropy coded prediction is the reset of CABAC states in ITU-T Rec. H.264.
Further, there can be coding mechanisms that do not fall into the common understanding of entropy coding related prediction, as defined above, but which are still related to the reconstruction control information associated with the bitstream, rather than pixel values. As an example, even some older standards such as the ITU-T Rec. H.261 standard allow coding of motion vectors as relative to one or more previously coded motion vectors. The detection of a group-of-blocks (GOB), slice or picture header resets this prediction vector to (0, 0).
There are also prediction mechanisms that span multiple pictures. For example, motion compensation can use (possibly motion compensated) pixel values from one or more reference pictures for prediction. This type of prediction is broken through the macroblock type (or equivalent). For example, intra macroblocks do not generally use prediction from reference pictures, whereas inter macroblocks may. Intra and Inter slices, in this sense, are simply accumulations of macroblocks belonging to those different macroblock types.
There are also prediction levels that include prediction based on pixel values that have already been reconstructed during the reconstruction process of the picture being encoded. One example is intra prediction mechanisms, such as the ones described in Annex I of ITU-T Rec. H.263. (Similar mechanisms are available in other video coding standards as well.)
In addition to prediction mechanisms, several video coding standards specify filters for performing in-loop filtering. One example is the in-loop filter specified in Annex J of ITU-T Rec. H.263.
For some applications, it may be advantageous to segment the picture being encoded into smaller data blocks, which segmenting can occur prior to, or during, the encoding. Two use cases for which picture segmentation may be advantageous are described below.
The first such use case involves parallel processing. Previously, standard definition video (e.g., 720×480 or 720×576 pixels) was the largest format in widespread commercial use. More recently HD (up to 1920×1080 pixels) formats as well as 4 k (4096×2048 pixels), 8 k (8192×4096 pixels), and still larger formats are emerging and finding use in a variety of application spaces. Despite the increase in affordable computing power over the years, as a result of the very large picture sizes associated with some of these newer and larger formats, it is often advantageous to leverage the efficiency of parallel processing to the encoding and decoding processes. Parallel encoding and decoding may occur, for example, at the instruction level (e.g., using SIMD), in a pipeline where several video coding units may be processed at different stages simultaneously, or on a large structure basis where collections of video coding sub units are processed by separate computing engines as separate entities (e.g., a multi-core general purpose processor). The last form of parallel processing can require picture segmentation.
The second such use case involves picture segmentation so as to create a bitstream suitable for efficient transport over packet networks. Codecs whose coded video is transported over IP and other packet networks can be subject to a maximum transmission unit (“MTU”) size constraint. It is sometimes advantageous for the coded slice size to be such that the resulting packet containing the coded slice is as close to the MTU size as possible without exceeding that size, so as to keep the payload/packetization overhead ratio high, while avoiding fragmentation (and the resulting higher loss probability) by the network.
The MTU size differs widely from network to network. For example, the MTU size of many Internet connections may be set by the smallest MTU size of network infrastructure commonly used on the Internet, which often corresponds to limitations in Ethernet and may be roughly 1500 bytes.
The number of bits in a coded picture depends on many factors such as the source picture's dimensions, the desired quality, the complexity of the content in terms of suitability for prediction, and other factors. However, even at moderate quality settings and content complexity, for sequences of HD resolution and above, the size of an average coded picture easily exceeds the MTU size. For example, a video conferencing encoder can require about 2 MBit/s to encode a 720p60 video sequence. This results in an average coded picture size of roughly 33333 bits or 4167 bytes, which is considerably more than the 1500 bytes of the Internet's MTU size. At higher resolutions, the average picture size increases to values significantly above the Internet's MTU size. Assuming a similar compression ratio as in the 720p60 example above, a 4096×2048 (4 k) video at 60 fps (4kp60) may require over 300,000 bits, or 25 MTU-sized packets for each coded video picture.
In many previous video coding standards (for example, up to and including WD3), a picture segment (or, at least, one form of a picture segment) is known as a “slice”. In the following description, any kind of (e.g., video coding based) picture fragmentation that breaks at least one form of in-picture prediction, in-loop filtering, or other coding mechanism, may be referred to generally as a “slice”. As such, structures such as the Group Of Blocks (“GOB”) in ITU.T Rec. H.261 or ITU Rec. H.263 (available from the ITU; see above for H.264), slices in H.264 or the MPEG family of standards, may each constitute a “slice” as this term is used herein throughout. However, fragmentation units of RFC3984 or data partitions of H.264 may not constitute a “slice”, as this term is used herein throughout, because they subdivide the bitstream of a coded picture and do not break in-picture prediction, in-loop filtering or another coding mechanism.
Referring to FIG. 1, shown is an example 100 of picture segmentation using slices. A picture 101 is broken into two scan order slices 102, 103. The slice boundary is shown as a boldface line 104. The first macroblock 105 of the second slice 103 has address 11. The corresponding bitstream 106 for transmitting the picture 101, for example, when generated using the H.264 standard, can contain one or more parameter sets 107 that do not contain information about the slice boundaries, followed by the slice headers 108, 110 and slice data 109,111 of the two slices 102, 103. The slice header 110 of the second slice 103 is shown enlarged. The dimensions of the uncoded slice 103, for example, are determined by a decoder by a combination of at least two factors. First, the slice header 110 contains the address of the first macroblock 105 of slice 103. Second, the end of the slice is determined, for example, by the detection of a new slice header in the bitstream or, in the depicted example, by the end of the coded picture in the bitstream 112, i.e., after macroblock 24. All macroblocks between the first macroblock and the end of the slice make up the slice. It is noted that scan order modifications, such as Flexible Macroblock Ordering of H.264, can change the number of macroblocks in the slice by creating gaps.
One advantage of using slices over media-unaware segmentation mechanisms, such as, for example, those provided by IP at the routing layer, is that slices are at least to a certain extent independently decodeable (as discussed below in more detail), by breaking certain types of prediction at the boundaries between slices. The loss of one slice therefore does not necessarily render the other slices of a coded picture unusable or un-decodeable. Depending on the implementation of a fragmentation mechanism, the loss of a fragment, in contrast, may well render many other fragments unusable because fragmentation, as this term is used herein throughout, does not break any form(s) of prediction.
WD4 (B. Bross et. al., “WD4: Working Draft 4 of High-Efficiency Video Coding”, available from http://wftp3.itu.int/av-arch/jctvc-site/2011_07_F_Torino/) is a draft specification relating to a digital video coding standard in development, which may be referred to as High Efficiency Video Coding (HEVC) or H.265. In addition to slices, WD4 also includes a picture segmentation mechanism known as “Tiles”. According to WD4, a source picture can be divided into rectangular units called tiles, such that each pixel of the source picture is part of a tile (other constraints may also apply). A tile is, therefore, a rectangular part of a picture. Tile boundaries are determined by coordinates available in high-level syntax structures, which are known in WD4 as parameter sets. Tiles are described in more detail below.
With the possible exception of inter picture prediction, each of the in-picture prediction mechanisms or coding mechanisms described above may be broken by the decoding of a picture header (or equivalent, such as the decoding of a slice with a frame number different from the previous slice). Whether those prediction mechanisms are broken across slice or tile boundaries depends on the video compression standard, and the type of slice in use.
In H.264, slices may be independently decodeable with respect to motion vector prediction, intra prediction, CA-VLC and CABAC states, and other aspects of the H.264 standard. Only inter picture prediction (including import of pixel data outside of the slice boundaries through motion compensation) is allowed. While this decoding independence increases error resilience, disallowing the aforementioned prediction across slice boundaries reduces coding efficiency.
In H.263, a video encoder has more flexibility in selecting which prediction mechanisms are broken through the use of slices or GOBs with non-empty GOB headers. For example, there is a bit included in the picture header, selectable when Annex R is in use, which signals to the decoder that no prediction or filtering at all occurs across slice/GOB (with non-empty headers) boundaries. Certain prediction mechanisms, such as motion vector prediction are broken across GOBs with non-empty headers and across slice boundaries, regardless of the state of Annex R. Others are controlled by Annex R. For example, if the bit is not set, motion vectors may point outside the spatial area co-located with the current slice/GOB with non-empty header in the reference picture(s), thereby potentially “importing” sample values that are used for motion compensation into the current slice from an area that is not inside of the geometric area of the slice/GOB in the reference picture. Further, unless Annex R is active, loop filtering may incorporate sample values outside of the slice/GOB. Similarly, there is another bit in the picture header that enables or disables Intra prediction.
However, in most standards, the decision of breaking in picture prediction is made at least at picture granularity, and in some cases at sequence granularity. In other words, using H.263 as an example, it is not possible to mix slices in a given picture that have the deblocking filter enabled or disabled (respectively), nor is it possible to enable/disable intra prediction at the slice level.
As already described, picture segmentation allows breaking a picture into spatial areas smaller than a whole picture. While the most common applications for picture segmentation, as described, appear to be MTU size matching and parallelization, picture segmentation can also be used for many other purposes, including those that adapt the segment size and shape to the content. Region of interest coding is one of several examples. In such cases, it is possible that certain parts of a picture can be more efficiently coded than others (in the sense that spending a lower number of bits for encoding yield comparable visual experience) when different coding tools, including different prediction mechanisms, are applied. For example, some content may benefit from deblocking filtering and may not respond well to intra prediction, whereas other content in the same picture may better be coded without deblocking filtering, but could benefit from intra prediction. A third content may best be coded with both deblocking filtering and intra prediction enabled. All this content can be located in the same picture when the picture is tiled, which occurs, for example, in interview situations, or in video conferencing.
One shortcoming of the existing mechanisms for prediction breaking at segment boundaries is that the enablement and/or disablement of the prediction breaking is generally hard-coded into the existing video coding standards, thereby making it difficult or impossible to selectively break prediction mechanisms at segment boundaries based, for example, on the characteristics of the content to be encoded.
A need therefore exists for an improved method and system to enable or disable, on a per slice basis, prediction and in-loop filtering mechanisms individually, or as a group. Accordingly, a solution that addresses, at least in part, the above and other shortcomings is desired.
Further, a need exists on a per picture (or group of pictures, sequences, etc.) basis to enable, or disable prediction mechanisms and/or in-loop filtering mechanisms across header-less (or equivalent) picture segment boundaries (such as tile boundaries) individually, or as a group. Accordingly, a solution that addresses, at least in part, the above and other shortcomings is desired.