Generally, video image data includes a large amount of data. Thus, a device for handling video image data compresses the video image data by encoding the video image data, when sending the video image data to another device or when storing the video image data in a storage device.
As a representative standard technology for encoding video images, MPEG (Moving Picture Experts Group phase)-2, MPEG-4, or MPEG-4 AVC/H.264 (H.264 MPEG-4 Advanced Video Coding) developed at ISO/IEC (International Standardization Organization/International Electrotechnical Commission) is widely used.
As standard encoding technologies described above, there is an inter encoding method for encoding a picture by using information of the picture that is the encoding target and information of pictures before and after the encoding target, and an intra encoding method for encoding a picture by using only information of the picture that is the encoding target.
Generally, the encoding amount of pictures or blocks that have been encoded by the inter encoding method is smaller than the encoding amount of pictures or blocks that have been encoded by the intra encoding method. Therefore, according to the selected encoding mode, the encoding amount of pictures becomes disproportionate within the same sequence. Similarly, according to the selected encoding mode, the encoding amount of blocks becomes disproportionate within the same picture.
Therefore, in order to transmit a data stream including encoded video images by a constant transmission rate even if the encoding amount varies over time, the transmission source device is provided with a transmitting buffer for a data stream, and the transmission destination device is provided with a receiving buffer for a data stream.
A delay caused by these buffers (hereinafter, “buffer delay”) is the main factor causing a delay from when each picture is input in the encoding device until each picture is displayed in a decoding device (hereinafter “codec delay”). As the codec delay, there is decoding delay that is a delay relevant to decoding, and display delay that is a delay relevant to display (output).
By reducing the size of the buffer, the buffer delay and the codec delay are reduced. However, as the size of the buffer decreases, the degree in freedom in allocating the encoding amount for each picture decreases. Consequently, the image quality of a reproduced video image is deteriorated. The degree in freedom in allocating the encoding amount means the extent of variation in the encoding amount.
MPEG-2 and MPEG-4 AVC/H.264 respectively specify VBV (Video Buffering Verifier) and CPB (Coded Picture Buffer), which are operations of a receiving buffer in an ideal decoding device.
A video image encoding device controls the encoding amount so that the receiving buffer of an ideal decoding device does not overflow or underflow. An ideal decoding device is specified to perform instantaneous decoding, where the time taken for a decoding process is zero. For example, there is a technology for controlling a video image encoding device relevant to VBV (see, for example, Patent Document 1).
The video image encoding device controls the encoding amount to ensure that data of a picture to be decoded is stored in the receiving buffer at the time when the ideal decoding device decodes the picture, so that the receiving buffer of the ideal decoding device does not overflow or underflow.
The receiving buffer underflows when the video image encoding device transmits a stream by a constant transmission rate, but transmission of data used for decoding the picture is not completed until the time when the video image decoding device decodes and displays the pictures, because there is a large encoding amount for each picture. That is to say, underflow of the receiving buffer means that data used for decoding a picture is not present in the receiving buffer of the decoding device. In this case, it is not possible for the video image decoding device to perform a decoding process, and therefore frame skip occurs.
In order to perform a decoding process without causing the receiving buffer to underflow, the video image decoding device displays a picture after delaying a stream by a predetermined length of time from the receiving time.
As described above, an ideal decoding device is specified so that the decoding process is instantaneously completed by a processing time of zero. Therefore, assuming that the time of inputting an “i” th picture (hereinafter, also expressed as “P(i)”) in the video image encoding device is t(i) and the time of decoding P(i) in the ideal decoding device is dt(i), it is possible to display this picture at the same time as the decode time, i.e., at dt(i).
For all pictures, the display time period of the picture {t(i+1)−t(i)} and {dt(i+1)−dt(i)} are equal, and therefore the decode time dt(i) becomes {dt(i)=t(i)+dly}, which is delayed by a fixed time dly from the input time t(i). Accordingly, the video image encoding device has to complete transmitting data used for decoding to the receiving buffer of the video image decoding device until the time dt(i).
FIG. 1 illustrates an example of the transition of the buffer occupancy amount of the receiving buffer according to the conventional technology. In the example of FIG. 1, the horizontal axis indicates the time and the vertical axis indicates the buffer occupancy amount of the receiving buffer. A line 10 indicated by a solid line indicates the buffer occupancy amount at each time point.
In the receiving buffer, the buffer occupancy amount is recovered at a predetermined transmission rate, and data used for decoding a picture at the decode time of each picture is extracted from the buffer. In the example of FIG. 1, data of P(i) starts to be input to the receiving buffer at a time at(i), and the last data of the P(i) is input at a time ft(i). The ideal decoding device completes decoding P(i) at a time dt(i), and it is possible to display P(i) at the time dt(i).
The ideal decoding device performs instantaneous decoding, while an actual video image decoding device takes a predetermined length of time to perform a decoding process. Generally, the decoding process time for one picture is shorter than the display period of a picture; however, the actual video image decoding device takes an amount of time close to the display period of a picture for performing the decoding process.
The data of P(i) is input to the receiving buffer from the time at(i) to the time ft(i). However, the time at which data used for decoding each block arrives between at(i) and ft(i) is not ensured. Therefore, the actual video image decoding device starts the process of decoding P(i) from the time ft(i). Accordingly, assuming that the maximum processing time to be taken for decoding one picture is ct, it is only possible to ensure that the actual video image decoding device completes the decoding process within the time ft(i)+ct.
The video image encoding device ensures that data used for decoding P(i) arrives at the receiving buffer until the time dt(i), i.e., it is ensured that ft(i) dt(i) is satisfied. Thus, when ft(i) is at the latest time, ft(i) becomes the same as dt(i).
In this case, the time at which completion of the decoding process of the entire P(i) is ensured is dt(i)+ct. To display all pictures at equal intervals, the video image decoding device is to delay the display times of the respective pictures by at least a time ct with respect to the ideal decoding device.
In VBV of MPEG-2 and CPB of MPEG-4 AVC/H.264, the difference between the arrival time of each encoded picture in the video image decoding device and the display time of each encoded picture that has been decoded is expressed as (ft(i)−at(i)+ct). That is to say, it is difficult to achieve a codec delay of less than the time ct, where the codec delay extends from when each picture is input to the encoding device to when the picture is output at the decoding device. That is to say, the time ct is usually the processing time for one picture, and therefore it is difficult to achieve a codec delay of less than the processing time for one picture.
Patent Document 1: Japanese Laid-Open Patent Publication No. 2003-179938
Non-patent Document 1: JCTVC-H1003, “High-Efficiency Video Coding (HEVC) text specification draft 6”, Joint Collaborative Team on Video Coding of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, February 2012
Non-patent Document 2: MPEG-2 Test Model 5. April 1993.ISO-IEC/JTC1/SC29/WG11/N0400 (http://www.mpeg.org/MPEG/MSSG/tm5/)
In the conventional technology, it is difficult to make a codec delay become the processing time for one picture. However, there is the following method for making the codec delay become less than the processing time for one picture. For example, this method is for assigning each block in a picture to one of an N number of groups, and assigning a decode start time to each group. A group is, for example, one block line. A block line expresses a line of blocks in the horizontal direction of the picture.
If the amount of information generated in each group is made uniform, the difference in the decode start time of continuous groups matches the processing time for each group, and the time ct becomes the processing time ct/N of each group. Thus, as a result, it is possible to decrease the codec delay to the processing time for each group.
FIG. 2 illustrates an example where the codec delay is made to be less than one picture time by group division. A graph line 17 in FIG. 2 expresses the time transition of the buffer occupancy amount of the conventional method. Meanwhile, a graph line 15 in FIG. 2 expresses the time transition of the buffer occupancy amount according to group division.
According to the group division method, the decode start time dgt(i, n) of the “n” th group of P(i) (hereinafter, also expressed as G(i, n)) is defined, and the buffer occupancy amount is decreased. Each group is decoded by taking the group decode time ct/N indicated by the reference numeral 16 starting from the corresponding decode start time. Therefore, the delay in the display possible time (the time during which display is possible) of each group is reduced.
In the group division method, the amount of information generated in each group is substantially equal, and therefore the codec delay is reduced to the time per group. Codec delay is the maximum value in a case where the information generation amount in each block in the group is significantly disproportionate. However, under actual circumstances, the disproportion in the generated information amount in each block in the group is reduced by appropriate rate control. In this case, it is theoretically possible to further reduce the code delay, but this is difficult to achieve by the block division method. The reason for this is described with reference to FIGS. 3 through 6.
FIG. 3 illustrates operations of a receiving buffer of the video image decoding device. In the example of FIG. 3, the cumulative value of the amount of encoded data arriving at the receiving buffer, and the cumulative value of the encoded data consumed by a decoding process are used to express the operations of a receiving buffer.
A graph line 20 in FIG. 3 expresses the cumulative value of the amount of encoded data arriving at the receiving buffer. The encoded data is transmitted from the video image encoding device to the video image decoding device by a fixed rate R. In the example of FIG. 3, the first bit arrives at the receiving buffer of the video image decoding device at a time “at(0)”, which is zero.
A graph line 21 in FIG. 3 expresses the cumulative value of encoded data consumed by an instantaneous decoding process in units of pictures. After the initial delay dly, the “i” th picture P(i) (i=0, . . . ) is sequentially subjected to instantaneous decoding at dt(i). The difference dt(i+1)−dt(i) in the instantaneous decode time between two continuous pictures is fixed. The encoding information amount of P(i) is expressed by b(i).
at(i) and ft(i) express the time at which the first bit in the encoded data of P(i) and the last bit in the encoded data of P(i) arrive at the video image decoding device, respectively. In order to prevent the receiving buffer of the video image decoding device from underflowing, all encoded data of P(i) is to arrive at dt(i). That is to say, dt(i)≧ft(i) and dt(i−1)≧at(i) are to be satisfied.
The capacity of the receiving buffer at each time corresponds to the difference between the graph line 20 and the graph line 21 at each time. For example, the capacity of the receiving buffer after instantaneous decoding of P(0) at time dt(0) is the bit amount indicated by a reference numeral 25.
FIG. 4 illustrates the operation of the receiving buffer focusing on one P(i). FIG. 4 is illustrated by enlarging part of FIG. 3.
Particularly, the example of FIG. 4 illustrates a case where instantaneous decoding is performed in units of pictures, the receiving buffer of the video image decoding device does not underflow, and at(i) and ft(i) are the latest times, i.e., dt(i)=ft(i) and dt(i−1)=at(i). In the example of FIG. 4, the number of groups N is 4, and the number of blocks and the generated information amount of each of the groups dgt(i, n+1)−dgt(i, n) is uniform.
A graph line 30 in FIG. 4 expresses the cumulative value of the amount of encoded data arriving at the receiving buffer of the video image decoding device. A graph line 31 expresses the cumulative value of the encoded data consumed by instantaneous decoding in units of pictures.
A graph line 32 expresses the cumulative value of the encoded data consumed by instantaneous decoding in the “n” th group G(i, n) of P(i) at dgt(i, n).
In the group division method, it is assumed that the amounts of generated information in the respective groups are averaged in the picture. That is to say, the total sum of the amounts of generated information in the blocks in the groups of P(i) is b(i)/N. b(i) is the amount of generated information in P(i).
The minimum value of the amount of generated information in the blocks in the groups of P(i) is zero, and the maximum value is b(i)/N. In a case where the blocks in P(i) are instantaneously decoded at equal intervals from dt(i−1) to dt(i), a graph line f(t) expressing the cumulative value of the consumed encoded data is present inside square areas indicated by reference numerals 35 through 38.
When the amounts of generated information in the blocks are equal, f(t) is a straight line (matching graph line 30) joining the bottom left vertex and the top right vertex of each of the square areas indicated by reference numerals 35 through 38. When a bit amount of the entire group is generated at the leading block, f(t) is a line connecting the left edge and the top edge of each of the square areas. The latter case corresponds to the maximum delay in terms of buffer delay.
In the example of FIG. 4, between the times of dt(i−1) to dt(i), the bits of the blocks in P(i) arrive at the receiving buffer. The arrival time g(x) of the “x” th bit (x=[1, b(i)]) is expressed by the following formula.
                              g          ⁡                      (            x            )                          =                              dt            ⁡                          (                              i                -                1                            )                                +                                    (                                                dt                  ⁡                                      (                    i                    )                                                  -                                  dt                  ⁡                                      (                                          i                      -                      1                                        )                                                              )                        *                          (                              x                                  b                  ⁡                                      (                    i                    )                                                              )                                                          Formula        ⁢                                  ⁢        1            
In view of the operations of an actual video image decoding device, a case where the blocks in P(i) are instantaneously decoded at equal intervals from dt(i−1) to dt(i) is considered. Assuming that the total number of blocks in the picture is M, the ideal instantaneous decode time p(i, m) of the “m” th block in P(i) is expressed by the following formula.
                              p          ⁡                      (                          i              ,              m                        )                          =                              dt            ⁡                          (                              i                -                1                            )                                +                                    (                                                dt                  ⁡                                      (                    i                    )                                                  -                                  dt                  ⁡                                      (                                          i                      -                      1                                        )                                                              )                        *                          (                              m                M                            )                                                          Formula        ⁢                                  ⁢        2            
Depending on the shape of f(t), f(t) may be above the graph line 30. That is to say, f(p(i, m))<g(f(p(i, m))) is satisfied, and all bits used for decoding the block do not reach the receiving buffer of the video image decoding device, and underflow occurs. When the blocks have an equal number of bits, f(p(i, m))=g(f(p(i, m))) is satisfied and underflow does not occur, but this is the worst case in terms of buffer delay.
When a bit amount of the entire group is generated at the leading block, the arrival time of all bits used for decoding the leading block is delayed by dgt(i, n+1)−dtg(i, n).
In the group division method, the shape of f(t) is not known to the video image decoding device. Therefore, it is ensured that underflow is avoided even if the bit arrival delay of the leading block of G(i, n) is the maximum value dgt(i, n)−dgt(i, n−1). Thus, the instantaneous decode time of all blocks in G(i, n) are to be delayed to dgt(i, n). That is to say, the decode start time of the leading block in P(i) is dgt(i, 1). Thus, the first problem with the conventional technology is that it is not possible to further reduce the codec delay.
Furthermore, in the conventional technology, it is assumed that it is possible to instantaneously display the picture after decoding by a decode time ct/N. However, in Non-patent Document 1, an encoding method referred to as tiles is used, by which the picture is not only be divided horizontally, but may also be divided vertically. Thus, even after decoding by a decode time ct/N, there may be cases where it is not possible to instantaneously display the picture. An example where it is not possible to instantaneously display the picture is described with reference to FIG. 5.
FIG. 5 illustrates an example where instantaneous display of an image is not possible. In Non-patent Document 1, the areas of a picture, which are obtained by dividing the picture not only horizontally but also vertically, are referred to as tiles. In the example of FIG. 5, the picture is divided into four tiles.
In the order of top left, top right, bottom left, and bottom right, the tiles are referred to as tile 0 (t40), tile 1 (t41), tile 2 (t42), and tile 3 (t43), and the tiles are processed in this order.
Furthermore, inside each tile, there are several groups including plural blocks. In the example of FIG. 5, groups 0 through 3 are indicated by s41 through s44. In this case, the decoding is performed in the order of groups, which is a scan order or a decoding order as indicated by reference numerals sc41 to sc42.
Unlike the decoding order, the display order may be a raster scan depending on the display. In this case, the order is as indicated by the reference numeral sc43. In this case, even if the decoding process for the groups is completed, it is not be possible to instantaneously display the picture.
For example, immediately after decoding a group 0 (s41), the CTB in the left half of the upper stage of the picture included in the tile 0 (t40), e.g., a block b41 and a block b42, belong to the group 0 (s41) and are thus displayable. However, the CTB in the right half of the upper stage of the picture included in the tile 1 (t41), e.g., a block b45 and a block b46, belong to the group 2 (s43), are not decoded and are thus not displayable.
When the display is performed by raster scan, the structure is configured to display pictures in the order from the left edge of the screen to the right edge of the screen. Therefore, when the top stage of the picture is to be displayed, the block belonging to group 2 (s43) is to be displayed. Therefore, it is to be waited for group 2 (s43) to be decoded so that group 2 (s43) becomes displayable.
The time taken for the decoding of group 2 (s43) to be completed is the time taken to decode all blocks through which sc41 and sc42 pass in the scan order.
In the group division method, decoding may be performed quickly, but there is no consideration about the displayable time. Thus, the second problem with the conventional technology is that in order to ensure that a picture is displayed, the time for one picture is to be waited.
Furthermore, Non-patent Document 1 defines an operation when the bit amount to be used for decoding a picture is larger than the bit amount that may be accumulated in a buffer, in a case where the picture is more complex.
FIG. 6 illustrates an operation when the bit amount to be used for decoding a picture is larger than the bit amount that may be accumulated in a buffer. The video image encoding device adjusts the encoding amount so that the accumulation of rate R indicated by a predetermined rate 51 in a graph 50 in FIG. 6 does not exceed the accumulation 52 of the drawn out bit amount of the picture.
However, when the picture is complex, the bit amount accumulated in the buffer is not enough for encoding, and there are cases where underflow occurs. An example is the case of a graph 53 in FIG. 6.
When underflow occurs, as indicated by a graph 54 in FIG. 6, the decoding device does not start decoding at the original decode time dt(0) of the picture, but executes decoding at the time dt′ when bits used for decoding are received at the buffer.
Generally, the display timing of a delayed picture is the timing dt(1), which is when the next picture is supposed to be displayed. For the picture that is supposed to be displayed at the time dt(1), decoding is performed but displaying is skipped.
The third problem with the conventional technology is that Non-patent Document 1 does not clearly define the operation when underflow occurs is units of groups.