Digital video capabilities can be incorporated into a wide range of devices, including digital televisions, digital direct broadcast systems, wireless broadcast systems, personal digital assistants (PDAs), laptop or desktop computers, video cameras, digital recording devices, video gaming devices, video game consoles, cellular or satellite radio telephones, and the like. Digital video devices may implement video compression techniques, such as those described in standards like MPEG-2, MPEG-4, both available from the International Organization for Standardization (“ISO”) 1, ch. de la Voie-Creuse, Case postale 56, CH-1211 Geneva 20, Switzerland, or www.iso.org, or ITU-T H.264/MPEG-4, Part 10, Advanced Video Coding (“AVC”), available from the International Telecommunication Union (“ITU”), Place de Nations, CH-1211 Geneva 20, Switzerland or www.itu.int, each of which is incorporated herein by reference in their entirety, or according to other standard or non-standard specifications, to encode and/or decode digital video information efficiently.
A video encoder can receive uncoded video information for processing in any suitable format, which may be a digital format conforming to ITU-R BT 601 (available from the International Telecommunications Union, Place des Nations, 1211 Geneva 20, Switzerland, www.itu.int, and which is incorporated herein by reference in its entirety) or in some other digital format. The uncoded video may be organized both spatially into pixel values arranged in one or more two-dimensional matrices as well as temporally into a series of uncoded pictures, with each uncoded picture comprising one or more of the above-mentioned two-dimensional matrices ofpixel values. Further, each pixel may comprise a number of separate components used to represent color in digital format. One common format for uncoded video that is input to a video encoder has, for each group of four pixels, four luminance samples which contain information regarding the brightness/lightness or darkness of the pixels, and two chrominance samples which contain color information (e.g., YCrCb 4:2:0).
One function of video encoders is to translate (more generally “transform”) uncoded pictures into a bitstream, packet stream, NAL unit stream, or other suitable transmission format (all referred to as “bitstream” henceforth), with goals such as reducing the amount of redundancy encoded into the bitstream to thereby increase transmission rates, increasing the resilience of the bitstream to suppress bit errors or packet erasures that may occur during transmission (collectively known as “error resilience”), or other application-specific goals. Embodiments of the present invention provided for at least one of the removal or reduction of redundancy, the increase in error resilience, and implementability of video encoders and/or associated decoders in parallel processing architectures.
One function of video decoders is to receive as its input a coded video in the form of a bitstream that may have been produced by a video encoder conforming to the same video compression standard. The video encoder then translates (more generally “transforms”) the received coded bitstream into uncoded video information that may be displayed, stored, or otherwise handled.
Both video encoders and video decoders may be implemented using hardware and/or software configuration, including combinations of both hardware and software. Implementations of either or both may include the use of programmable hardware components such as general purpose central processing units CPUs, such as those found in personal computers (PCs), embedded processors, graphic card processors, digital signal processors (DSPs), field programmable gate arrays (FPGAs), or others. To implement at least parts of the video encoding or decoding, instructions may be needed, and those instructions may be stored and distributed using one or more non-transitory computer readable media. Computer readable media choices include compact disc read-only memory (CD-ROM), digital videodisc read-only memory (DVD-ROM), memory stick, embedded ROM, or others.
In the following, certain systems, methods and/or aspects relating in at least one broad aspect to video compression and decompression, i.e., the operations performed in a video encoder and/or decoder, will be described. A video decoder may perform all, or a subset of, the inverse operations of the encoding operations. Unless otherwise noted, techniques of video encoding described herein are intended also to encompass the inverse of the described video encoding techniques (namely associated video decoding).
The uncompressed, digital representation of video can be viewed as a sample stream, wherein the samples can be processed by the video display in scan order. One type of boundary often occurring in this sample stream is the boundary between pictures in the sample stream. Many video compression standards recognize this boundary and often divide the coded bitstream at these boundaries, for example through the insertion of a picture header or other metadata at the beginning of each uncoded picture.
For some applications, it may be advantageous to segment the coded picture into smaller data blocks, which segmenting can occur prior to, or during, the encoding. Two use cases for which picture segmentation may be advantageous are described below.
The first such use case involves parallel processing. Previously, standard definition video (e.g., 720×480 or 720×576 pixels) was the largest format in widespread commercial use. More recently HD (up to 1920×1080 pixels) formats as well as 4k (4096×2048 pixels), 8k (8192×4096 pixels), and still larger formats are emerging and finding use in a variety of application spaces. Despite the increase in affordable computing power over the years, as a result of the very large picture sizes associated with some of these newer and larger formats, it is often advantageous to leverage the efficiency of parallel processing to the encoding and decoding processes. Parallel encoding and decoding may occur at the instruction level (e.g., using SIMD), in a pipeline where several video coding units may be processed at different stages simultaneously, or on a large structure basis where collections of video coding sub units are processed by separate computing engines as separate entities (e.g., a multi-core general purpose processor). The last form of parallel processing requires picture segmentation.
The second such use case involves picture segmentation so as to create a bitstream suitable for efficient transport over packet networks. Codecs whose coded video is transported over IP or other packet network protocols can be subject to a maximum transmission unit (“MTU”) size constraint. It is sometimes advantageous for the coded slice size to be such that the resulting packet containing the coded slice is as close to the MTU size as possible without exceeding that size, so as to keep the payload/packetization overhead ratio high, while avoiding fragmentation (and the resulting higher loss probability) by the network.
The MTU size differs widely from network to network. For example, the MTU size of many Internet connections may be set by the smallest MTU size of network infrastructure commonly used on the Internet, which often corresponds to limitations in Ethernet and may be roughly 1500 bytes.
The number of bits in a coded picture depends on many factors such as the source picture's dimensions, the desired quality, the complexity of the content in terms of suitability for prediction, the coding efficiency of the video coding standard, and other factors. However, even at moderate quality settings and content complexity, for sequences of HD resolution and above, the size of an average coded picture easily exceeds the MTU size. For example, a video conferencing encoder can require about 2 Mbits/sec to encode a 720p60 video sequence. This results in an average coded picture size of roughly 33000 bits or 4125 bytes, which is considerably more than the approximately 1500 bytes of the Internet's MTU size. At higher resolutions, the average picture size increases to values significantly above the Internet's MTU size. Assuming a similar compression ratio as in the 720p60 example above, a 4096×2048 (4k) video at 60 fps (4kp60) may require over 300,000 bits, or 25 MTU-sized packets for each coded video picture.
In many video coding standards, a picture segment (or, at least, one form of a picture segment) is known as a “slice”. In the following description, any kind of (e.g., video coding standard based) coded picture fragmentation that breaks any form of in-picture prediction or other coding mechanism may be referred to generally as a “slice”. As such, structures such as the Group Of Blocks (“GOB”) in ITU.T Rec. H.261 or ITU Rec. H.263 (available from the ITU; see above for H.264), slices in H.264 or the MPEG family of standards, may each constitute a “slice” as this term is used herein throughout. However, fragmentation units of RFC3984 or data partitions of 11.264 may not constitute a “slice”, as this term is used herein throughout, even if they subdivide the bitstream of a coded picture into smaller datablocks, because they do not break in picture prediction or another coding mechanism.
One advantage of using slices over media unaware segmentation mechanisms, such as, for example, those provided by IP at the routing layer, is that slices are at least to a certain extent independently decodable (as discussed below in more detail). The loss of one slice therefore does not necessarily render the other slices of a coded picture unusable or un-decodeable. Depending on the implementation of a fragmentation mechanism, the loss of a fragment, in contrast, may well render many other fragments unusable.
Many or all in-picture prediction mechanisms or coding mechanisms may broken by the decoding of a picture header (or equivalent). Whether those prediction mechanisms are broken also by the detection of a slice header may depend on the video compression standard, and the type of slice in use.
In H.264, individual video pictures may be segmented into one or more slices, thereby accommodating applications requiring or otherwise utilized pictures that are partitioned as part of the encoding/decoding process. Slices in H.264 may be independently decodable with respect to motion vector prediction, intra prediction, CA-VLC and CABAC states, and other aspects of the H.264 standard. While this decoding independence may realize increases in error resilience, disallowing the aforementioned prediction across slice boundaries may tend to reduce coding efficiency.
In H.263, a video encoder has more flexibility in selecting which prediction mechanisms are broken through the use of slices or GOBs with non-empty GOB headers. For example, there is a bit included in the picture header, selectable when Annex R is in use, which signals to the decoder that no prediction at all occurs across slice/GOB boundaries. If the bit is not set, though, motion vectors may point outside of the current slice, thereby potentially “importing” sample values that are used for motion compensation within the current slice. Further, loop filtering may incorporate sample values outside of the slice.
In most or all existing video coding standards, with the possible exception of flexible macroblock ordering (“FMO”) used as part of H.264, macroblocks within slices are ordered in raster scan order. Consequently, when video sequences with large picture sizes are partitioned into slices that encompass only a relatively small percentage of all macroblocks in the picture, the slices tend to be elongated when viewed spatially.
FIG. 1 shows an example picture 100 which is broken into slices in accordance with the prior art. Example picture 100 has a matrix 101 of 6×4 macroblocks, their boundaries indicated through hairlines. The picture 100 is divided into two slices 102, 103, with slice boundary 104 between the two slices 102, 103 indicated by a bold line. The first slice 102 contains 10 macroblocks in scan order, specifically, macroblock 1 through 10. The second slice 103 contains the remaining 14 macroblocks in the matrix 101 (i.e., macroblocks 11 through 24). The numerals in the macroblocks (e.g., numeral ‘11’ in macroblock 105) are the addresses of the macroblocks according to scan order.
The bitstream 106 represents the coded picture corresponding to picture 100, and can include one or more parameter sets 107 as an example of a high level syntax structure, which can include syntax elements relevant to more than one of the coded slices of the picture 100. The parameter set(s) 107 can be followed by one or more slices, each such slice comprising a corresponding slice header 108, 110, and corresponding slice data 109, 111, respectively. Accordingly, in this example, slice header 108 may be associated with slice data 109 and may correspond to slice 102 in matrix 101, while slice header 110 may be associated with slice data 111 and may corresponding to slice 103. The slice headers 108, 110 may include information such as the address of the first macroblock of that respective slice, according to scan order. For example, the second slice 103 when coded into bitstream 106 starts with slice header 110 that includes a first macroblock address of ‘11’, which designates the address of macroblock 105.
As can be seen in FIG. 1, slices 102 and 103 are somewhat elongated in the sense that each of slices 102 and 103 span more macroblocks horizontally (i.e., 6 macroblocks) than vertically (i.e., 2 to 3 macroblocks). Elongated slices such as slices 102 and 103 tend to contain diverse picture content as a result of the large distance from end to end horizontally. Further, elongated slices tend to have low ratios of slice area to slice perimeter/boundary. The combination of slices containing diverse picture content with relatively low area to perimeter/boundary ratios can be disadvantageous from a coding efficiency perspective when compared with a slice that encompasses a more squared area of a picture, such as squares or other geometric figures close to a square. Slices with this geometric property may henceforth be called “compact” slices within this description.
Also, many entropy coding tools that have two-dimensional properties, such as the coding of motion vectors or intra prediction modes, may be optimized for squared picture aspect ratios. For example, in H.264, the coding of a horizontal motion vector of a given length costs roughly the same number of bits as the coding of a vertical motion vector of the same length. Consequently, these coding tools may yield a better compression for compact slices than for “elongated” slices, such as slices 102 and 103 shown in FIG. 1.
Improved coding efficiency for compact slices may further arise from the fact homogenous content, which is more likely to be found in a compact slice, may be more efficiently encoded as compared with the relatively diverse content that is more likely to be found in an elongated slice. As a general though not necessarily absolute rule, picture content is more likely to be homogenous in a compact slice because the spatial distance from the center to the boundaries of the slice is less, on average, for a compact slice than for an elongated slice. Further, having a higher slice area to slice boundary ratio for compact slices means that fewer prediction mechanisms may generally be broken in a given picture, thereby resulting in higher coding efficiency.
In H.264, FMO allows the video encoder to effectively produce rectangular slices by defining rectangular slice groups. FMO is a highly generalized coding tool that was designed to address several issues encountered in video coding. However, from a practical standpoint, FMO tends to be perceived as having a relatively high degree of implementation complexity, resulting in somewhat limited adoption as an aspect of standard video compression. A simpler coding tool that may realize improved coding efficiency, as well as parallel encoding and decoding, may address or ameliorate one or more of the complexity issues associated with a full FMO implementation.
The issue of elongated slices may also appear in an extreme case in many MPEG-2 based encoding schemes. For example, in MPEG-2 encoding, it is often the case that each single row of macroblocks within a picture is encoded into a slice, thereby effectively breaking any in picture prediction mechanisms in the vertical dimension within the picture.
Rectangular slice mode is one of two sub-modes specified in Annex K of H.263, another being “scan order slice mode”, which has properties similar to the slices of 11.264 discussed above. Rectangular slices as provided for in H.263 may offer one or more of the earlier described advantages that compact slices provide. However, H.263 requires that the dimensions (specifically the width) of each slice must be conveyed in its corresponding header, which leads to coding inefficiency, for example, in applications in which the slice sizes in the horizontal dimension do not change from picture to picture. In addition, Annex K of H.263 does not specify a minimum slice width that would effectively prevent vertically elongated slices from being used. Vertically elongated slices may introduce implementation difficulties and would not in every case provide the desired coding efficiency advantages that, for the reasons discussed above for horizontally elongated slices, may be provided through use of more compact slices.
Constraining the slice to have a rectangular shape can also be disadvantageous in certain cases. First, rectangular slices may perform sub-optimally in applications for which the bitstreams use transport protocols subject to an MTU. For example, packets may be fragmented if the number of bits within a give packet exceeds the MTU limit imposed on the bitstream, which can be undesirable from at least network performance and error resilience perspectives. Conversely, if the number of bits within a given packet is far below the MTU limit, then the ratio of the number of bits in the transport and slice headers becomes relatively large as compared with the number of bits in the packet payload, thereby leading to coding inefficiencies. Requiring slices to be rectangular in shape limits the encoder's ability to precisely control the number of bits in the coded slice so as to avoid the above-mentioned disadvantages.
Second, rectangular slices may perform sub-optimally in applications that utilize parallel encoding and/or decoding. When encoding and/or decoding in parallel, it is typically advantageous to partition a picture into different parts such that each part of the picture requires approximately the same amount of computational power to encode. By partitioning the picture in this way, each part of the picture may therefore be encoded with nearly the same latency to thereby reduce or minimize lag between the encoding times of different parts of the picture. An encoder constrained to use rectangular slices may not be able to precisely control the amount of CPU capacity required to encode and/or decode each slice and thereby avoid this potential disadvantage.
In order to facilitate parallel decoding of slices belonging to the same coded picture, a decoder will generally assign coded picture segments to the various processors, processor cores, or other independently operating decoding mechanisms made available to the decoder for parallel decoding. Without the use of FMO, this was a generally difficult, in some cases extremely difficult, task for previous video coding standards to handle, as those previous standards would allow too much flexibility in the bit stream generation. For example, in H.264, it is possible that one picture may be coded in a single slice and another picture into dozens of slices within the same bitstream. If parallelization occurs at the slice level, when a picture is coded in a single slice, the processor assigned to decode that picture will need to be provisioned to handle its decoding in full. As a result, without imposing restrictions outside of the video coding standard, there may be comparatively little advantage realized by building parallel decoders if each decoding processor will need to be provisioned to be capable of handling a whole picture in any event.
The slice coding used in many MPEG-2 encoders is widely viewed to be the result of an agreement to utilize an informal Cable Labs specification that suggested a one slice per macroblock row segmentation scheme. Widespread acceptance of this informal specification was eventually gained. While there may have been value in such a segmentation scheme when the first MPEG-2 products became available, around 1995, today the various restrictions associated with the historical specification may significantly limit coding efficiency, although parallelization of decoding of (at least SD-coded) pictures has been a relative non-issue for at least a decade.
A need therefore exists for an improved method and system for picture segmentation that addresses, ameliorates or otherwise provides a useful alternative to the existing shortcomings of video encoders both in terms of MTU size matching and parallel decoding. Accordingly, a solution that addresses, at least in part, the above and other shortcomings is desired.