The present disclosure relates to an image decoding device and image decoding method for decoding a coded stream that has been coded by predictive processing.
As internet content providing technologies have been further advanced and widespread recently through smartphones, smart TVs, and various other mobile communications devices, a huge number of internet users are now provided with movies of even higher definition and even higher image quality. Meanwhile, there is a growing concern about an upsurge in communications traffic and a critical shortage of broadcast bands involved with the everlasting improvement in definition and image quality. Thus, to cope with such an upsurge in communications traffic and such a shortage of broadcast bands, the ITU-T (International Telecommunication Union Telecommunication Standardization Sector) issued in January 2013 a recommendation for an HEVC (High Efficiency Video Coding) standard as an international standardization organization standard H.265. According to the H.265 standard, a movie of the same quality can be compressed and transmitted in only half a data size compared to the H.264 (MPEG-4 AVC) standard. Thus, this H.265 standard has attracted a lot of attention lately as a viable solution for overcoming such an upsurge in communications traffic and such a shortage of broadcast bands. As for details of the H.265 standard, see, for example, ‘SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS Infrastructure of audiovisual services—Coding of moving video,’ [online]. Recommendation ITU-TH.265, 04/2013, [retrieved on Mar. 17, 2014]. Retrieved from the Internet:
<URL:http://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.265-201304-I!!PDF-E&type=items>.
According to the H.265 standard, the size of a coding unit block is variable, unlike the conventional coding standard H.264. An image coder that adopts this technique may also perform coding on the basis of a block, of which the size is even larger than that (16 pixels×16 pixels) of a macroblock as a conventional coding unit, and therefore, is able to code a high-definition image appropriately.
An exemplary configuration for a picture and an exemplary format for a coded stream according to the H.265 standard will now be described with reference to FIG. 22. As shown in FIG. 22A, a coding unit (CU) is defined as a coding data unit. Just like a macroblock in the conventional image coding standard, this coding unit is a data unit that enables a switch of the modes of predictive coding from an intra-predictive coding, in which an intra-picture prediction is carried out, to an inter-predictive coding involving motion compensation. This coding unit is defined as the most basic coding block.
According to the Main Profile of the H.265 standard, this coding unit may have a size of 8×8 pixels, 16×16 pixels, 32×32 pixels, or 64×64 pixels.
According to the H.265 standard, each picture is coded on the basis of a pixel block called “CTU (coding tree unit),” which is the largest coding unit. The size of the CTU is not fixed unlike a macroblock (of 16×16 pixels) according to the H.264 or MPEG-2 standard, but may be selected while a sequence is being coded.
According to the Main Profile of the H.265 standard, the largest coding unit is defined to be a block consisting of 64×64 pixels. Furthermore, a single picture may be coded on the basis of a slice comprised of multiple CTUs. FIG. 22A illustrates an example in which a single picture is comprised of a single slice.
Note that a series of coding processing steps of the intra- or inter-predictive coding are performed on the basis of a CU, which is obtained by means of a recursive quadtree division of a single CTU.
As far as the intra-predictive coding and inter-predictive coding are concerned, each CU is supposed to be coded with the CU subdivided into multiple blocks called “prediction units (PUs).”
Meanwhile, the frequency transformation and quantization of a predictive differential signal are performed on the basis of a block called “transform unit (TU),” which is a frequency transformation unit.
FIG. 22B illustrates an exemplary format for a coded stream. A coded stream is generally comprised of a sequence header, a picture header, a slice header, and slice data. In an image coded stream coded compliant with the H.265 standard, for example, start codes (hereinafter referred to as “SC”), each indicating the beginning of a header, are added to the header.
The sequence header represents header information with respect to a sequence indicating a set of pictures. The picture header represents header information with respect to a single picture. The slice header represents header information with respect to slice data. The slice data is comprised of CU layer data representing a plurality of CTUs and a plurality of CUs. According to the H.265 standard, the sequence header is also called a “sequence parameter set (SPS)” and the picture header is also called a “picture parameter set (PPS).”
FIG. 23 illustrates an example in which a CTU is divided into CUs, PUs, and then TUs. Since the CTU is the largest CU before being subjected to the quadtree division, the CTU is supposed to be forming CU Layer 0. Every time a coding unit is subjected to a quadtree division, the division may be performed recursively into CU Layer 1, CU Layer 2, and so on.
Just as a CTU is divided into four CUs, each TU may also be subjected to a quadtree division recursively inside the CU.
Each PU is defined by division in a single prediction mode (which will be hereinafter referred to as a “PU division mode” defined by PartMode) with respect to a CU that cannot be divided any further. For example, if a CU consisting of 32×32 pixels is divided in Part_N×2N PU division mode, the CU consisting of 32×32 pixels is divided into two PUs each consisting of 16×32 pixels.
In the case of intra prediction, the PU division mode may be selected from two PU division modes Part_2N×2N and Part_N×N. In the case of inter prediction, on the other hand, the PU division mode may be selected from eight division modes in total, namely, four PU division modes Part_2N×2N, Part_2N×N, Part_N×2N, and Part_N×N, each using blocks of the same size, and four division modes Part_2N×nU, Part_2N×nD, Part_nL×2N, and Part_nR×2N, each using two asymmetric blocks of different sizes (which are called “asymmetric motion partitions (AMPs).”
Note that each transform unit TU may be subjected to the quadtree division recursively independently of the PU division. Each transform unit TU may be comprised of N×N transform coefficients representing frequency components with respect to a predictive differential image (where N may be 4, 8, 16, or 32, for example).
FIGS. 24A and 24B illustrate an exemplary format for a coded stream representing its CU layer data and its underlying layer data according to the H.265 standard.
FIG. 24A illustrates configurations for a CU, a PU, and a TU. In the example illustrated in FIG. 24A, the CU and PU are each configured as a single block of 64×64 pixels, and the TU is configured as four blocks, each consisting of 32×32 pixels.
FIG. 24B illustrates an exemplary format for a coded stream representing its CU layer data and its underlying layer data according to the H.265 standard. Note that only reference signs to be used in the following description are shown in FIG. 24B. As for details, see the H.265 standard.
In FIG. 24B, the coding unit layer data corresponding to a single coding unit is comprised of a CU division flag and CU data (coding unit data). A CU division flag of “1” indicates that the given coding unit is divided into four. A CU division flag of “0” indicates that the given coding unit is not divided into four.
In the example illustrated in FIG. 24B, the coding unit consisting of 64×64 pixels is not divided, i.e., the CU division flag is “0.” Furthermore, the CU data is comprised of a CU type, PU data representing a motion vector or an intra-picture prediction mode, and TU layer data 0 made up of transform units including coefficients. The size of the prediction unit is determined by the CU type.
The PU data includes not only the motion vector or intra-picture prediction mode but also a flag representing a reference picture (which will be hereinafter referred to as a “reference index”) and information required to make inter prediction as well. The TU layer data 0 represents TU Layer 0 indicating a layer of the highest order, and is comprised of a TU division flag and TU Layer Data 1 just like the CU data.
Just like the CU division flag, a TU division flag of “1” indicates that the given transform unit is divided into four, while a TU division flag of “0” indicates that the given transform unit is not divided into four.
The TU Layer Data 1 is comprised of a TU division flag with respect to TU0, TU data (TU0), a TU division flag with respect to TU1, TU data (TU1), a TU division flag with respect to TU2, TU data (TU2), a TU division flag with respect to TU3, and TU data (TU3). Note that in the example illustrated in FIG. 24B, the TU division flag in the TU Layer Data 1 is “0.”
In this case, no TU division flag will appear in any TU data but TU0 unless the TU data is decoded through the previous transform unit (e.g., through TU0 as for TU1). Thus, it can be seen that the size of each TU is not fixed.
FIG. 25 illustrates PU configurations which are selectable on a CU configuration basis in the inter-prediction mode according to the H.265 standard. For example, in the case of a 64×64 CU, a 64×64 PU, 64×32 PUs, 32×64 PUs, 32×32 PUs, a 64×16 PU and a 64×48 PU, or a 16×64 PU and a 48×64 PU may be selected according to the PartMode.
Then, on a prediction unit basis, a flag representing a motion vector or a reference picture (which will be hereinafter referred to as a “reference index”) is specified in the case of inter prediction, and an intra-picture prediction mode is specified in the case of intra prediction.
FIG. 26 illustrates TU configurations which are selectable according to the H.265 standard. Specifically, these TUs are configured as a 32×32 TU, a 16×16 TU, an 8×8 TU, and a 4×4 TU, respectively, and each have a square configuration.
In the case of inter prediction, a reference image needs to be obtained from the reference picture specified by the motion vector.
FIGS. 27A and 27B generally illustrate how to perform motion compensation processing. As shown in FIGS. 27A and 27B, the motion compensation processing is performed to generate a predicted image by extracting a part of a previously decoded picture, which is specified by a motion vector decoded from a coded stream and a reference index, and then subjecting that part of the picture to a filter operation. In the case of the H.265 standard, the filter operation of the motion compensation processing is carried out with a filter with eight TAPs at maximum.
For example, if an 8 TAP filter is used for a reference picture of a prediction unit to be predicted with a size of 64×64 pixels (i.e., a 64×64 PU), then 7 pixels are added both vertically and horizontally to the 64×64 pixels as shown in FIG. 27A. Specifically, from the prediction unit to be predicted, of which the origin is located at an integral position specified by the motion vector, three pixels are added to the left, four pixels are added to the right, three pixels are added to the top, and four pixels are added to the bottom. Thus, the reference image extracted from the reference picture consists of 71×71 pixels.
FIG. 27B illustrates a situation where the prediction unit to be predicted has a size of 16×16 pixels. If an 8 TAP filter is used, the reference image extracted from the reference picture consists of 23×23 pixels as in the case of the 64×64 PU. Note that if the motion vector specifies an integral position, the reference image of the prediction unit does not have to be subjected to any filter processing. Thus, the size of the reference image required may be the same as that of the prediction unit.
According to the H.264 standard, the prediction may be performed at most on a macroblock basis. Thus, to obtain a reference image of the same size according to the H.264 standard, a prediction unit consisting of at most 23×23 pixels (i.e., 529 pixels) needs to be used, compared to a prediction unit consisting of 16×16 pixels (i.e., 256 pixels) according to the H.265 standard. However, to obtain a reference image according to the H.265 standard, the reference image may consist of at most 71×71 pixels (i.e., 5041 pixels) with respect to a prediction unit consisting of 64×64 pixels (i.e., 4096 pixels). That is to say, according to the H.265 standard, the size of the data required for a single prediction unit increases approximately 9.5 fold. In addition, to obtain a reference picture from an external memory (e.g., from an external SDRAM), the external memory bus is occupied for approximately 9.5 times as long a time as in the H.264 standard. As a result, systems performing various types of processing other than decoding will be affected significantly. For example, the image output processing for display and other types of processing will fail, which is a problem.
This problem may be overcome by dividing each prediction unit into multiple units of the smallest size such as 4×4 pixels or 8×8 pixels as in Japanese Unexamined Patent Publication No. 2006-311526, which however adopts this processing for the purpose of fixing the size of a motion compensation block at a single size. Nevertheless, the smaller the size of the units divided, the larger the number of pixels required for the filter processing. Thus, according to the H.265 standard, for example, the ratio of the number of pixels required for the 8 TAP filter (i.e., the ratio of the number of pixels (i.e., seven pixels) to be added both vertically and horizontally to the size of each prediction unit) increases, which significantly affects the external memory bandwidth, and eventually causes a performance failure.
For example, to apply an 8 TAP filter to a prediction unit of 16×16 pixels, the reference image needs to consist of 23×23 pixels (i.e., 529 pixels). However, if the prediction unit of 16×16 pixels is divided into sixteen 4×4 pixel blocks, a reference image consisting of sixteen 11×11 pixel (121 pixels) blocks is required to apply an 8 TAP filter. Consequently, the size of the reference image required becomes 1936 pixels (=121 pixels×16), which is approximately 3.6 times as large as the number of pixels of the prediction unit yet to be divided. As a result, the external memory bandwidth will be affected significantly, which is not beneficial.
Alternatively, the problem may also be overcome by dividing the prediction unit PU along the edges of the transform unit TU as in PCT International Application Publication No. 2013/076888. However, if the TU has a size of 32×32 pixels, a reference image consisting of 39×39 pixels (i.e., 1521 pixels) needs to be obtained for a prediction unit of the same size, i.e., 32×32 pixels. Thus, compared to the reference image with a size of 23×23 pixels (529 pixels) with respect to the prediction unit consisting of 16×16 pixels, the data size needs to be increased approximately threefold. Thus, the external memory bus will be occupied for a much longer time, systems performing various types of processing other than decoding will be affected significantly, and the image output processing for display, in particular, and other types of processing will fail, which is a problem.
Furthermore, if the prediction unit PU is divided along the edges of the transform unit TU as in PCT International Application Publication No. 2013/076888, the prediction unit decoding processing depends on the size of the transform unit. Thus, the prediction unit decoding processing cannot be started until the size of the transform unit is determined. As a result, the prediction unit decoding processing is delayed, which is also a problem. Note that according to the H.265 standard, the size of the transform unit cannot be determined until the TU layer decoding processing advances to the layer of the lowest order.
In addition, in a situation where the prediction unit is divided along the edges of the transform unit as in PCT International Application Publication No. 2013/076888, if the prediction unit is divided into TUs of an even smaller size (e.g., 4×4 TUs), then the prediction unit also needs to be subjected to prediction processing depending on the transform unit. Thus, the decoding processing performance of the prediction processing deteriorates with respect to a prediction unit of a larger size than the transform unit, which is also a problem.
The percentage of the external memory bus occupied while the reference image is being obtained may be reduced if a motion compensation circuit is divided according to the smallest size of the prediction unit as in the known art described above.
However, the smaller the size of the units divided, the larger the number of pixels required to perform the filter processing during the prediction processing, thus causing an increase in the bandwidth of the external memory, which is a problem. Such an increase in the bandwidth of the external memory affects the overall system performance including image output processing, which leads to a failure of the application.
Conversely, the larger the prediction unit, the larger the area of the circuit performing the prediction processing, which is not beneficial, either. Furthermore, if the prediction processing is performed with the prediction unit divided and adjusted to the size of the transform unit, the prediction processing cannot be started until the size of the transform unit is determined, which is also a problem. That is to say, the prediction processing is delayed and cannot be done quickly enough.
In view of the foregoing background, it is therefore an object of the present disclosure to provide an image decoding device and image decoding method allowing a coded stream, which has been coded by subjecting a prediction unit to prediction processing, to be decoded quickly enough without occupying the external memory bus or increasing the bandwidth of the external memory and while reducing the area of the circuit on the chip.