Recently, with the arrival of the age of multimedia in which audio, video and the like are integrally handled, existing information media such as newspapers, journals, TVs, radios and telephones and other means through which information is conveyed to people has come under the scope of multimedia. Generally speaking, multimedia refers to something that is represented by associating not only characters but also graphics, audio and especially images and the like together. However, in order to include the aforementioned existing information media in the scope of multimedia, it appears as a prerequisite to represent such information in digital form.
However, when estimating the amount of information contained in each of the aforementioned information media as the amount of digital information, the information amount per character requires 1 to 2 bytes whereas the audio requires more than 64 Kbits (telephone quality) per second, and when it comes to a moving picture, it requires more than 100 Mbits (present television reception quality) per second. Therefore, it is not realistic for the information media to handle such an enormous amount of information as it is in digital form. For example, although video phones are already in the actual use via Integrated Services Digital Network (ISDN) which offers a transmission speed of 64 Kbits/s to 1.5 Mbits/s, it is impossible to transmit images on televisions and images taken by cameras directly through ISDN.
This therefore requires information compression techniques, and for instance, in the case of the videophone, moving picture compression techniques compliant with H.261 and H.263 standards recommended by International Telecommunication Union-Telecommunication Standardization Sector (ITU-T) are employed. According to the information compression techniques compliant with the MPEG-1 standard, image information as well as audio information can be stored in an ordinary music Compact Disc (CD).
Here, Moving Picture Experts Group (MPEG) is an international standard for compression of moving picture signals standardized by International Standards Organization/International Electrotechnical Commission (ISO/IEC), and MPEG-1 is a standard to compress moving picture signals down to 1.5 Mbps, that is, to compress information of TV signals approximately down to a hundredth. The transmission rate within the scope of the MPEG-1 standard is set to about 1.5 Mbps to achieve the middle-quality picture, therefore, MPEG-2 which was standardized with the view to meet the requirements of high-quality picture allows data transmission of moving picture signals at a rate of 2 to 15 Mbps to achieve the quality of TV broadcasting. In the present circumstances, a working group (ISO/IEC JTC1/SC29/WG11) in charge of the standardization of the MPEG-1 and the MPEG-2 has achieved a compression rate which goes beyond what the MPEG-1 and the MPEG-2 have achieved, further enabled encoding/decoding operations on a per-object basis and standardized MPEG-4 in order to realize a new function required in the era of multimedia. In the process of the standardization of the MPEG-4, the standardization of encoding method for a low bit rate was aimed; however, the aim is presently extended to a more versatile coding of moving pictures at a high bit rate including interlaced pictures.
Furthermore, MPEG-4 AVC and ITU H.264 have been standardized since 2003 as a next-generation picture coding scheme with higher compression rate, which are jointly worked by the ISO/IEC and the ITU-T (for example, refer to Non-Patent Reference 1). Currently, regarding H.264 standard, a draft of its revised standard in compliance with a High Profile which is suited for High Definition (HD) pictures have been developed. Whereas a DVD player which reproduces a movie and the like are widely known as an application of using the MPEG-2 moving picture compression technology, the H.264 compression standard has been scheduled to be adapted to a player using a Blu-ray Disk ROM (BD-ROM). Currently, its format standard has been developed.
In general, in coding of a moving picture, the amount of information is compressed by reducing redundancy in temporal and spatial directions. Therefore, an inter-picture prediction coding, which aims at reducing the temporal redundancy, estimates a motion and generates a predictive picture on a block-by-block basis with reference to prior and subsequent pictures, and then codes a differential value between the obtained predictive picture and a current picture to be coded. Here, a “picture” is a term to represent a single screen and it represents a frame when used for a progressive picture whereas it represents a frame or fields when used for an interlaced picture. The interlaced picture here is a picture in which a single frame consists of two fields respectively having different time. For coding and decoding an interlaced picture, three ways are possible: processing a single frame either as a frame, as two fields or as a frame/field structure depending on a block in the frame.
A picture to which an intra-picture prediction coding is performed without reference pictures is referred to as an “I-picture”. A picture to which the inter-picture prediction coding is performed with reference to only a single picture is referred to as a “P-picture”. A picture to which the inter-picture prediction coding is performed by referring simultaneously to two pictures is referred to as a “B-picture”. The B-picture can refer to two pictures, arbitrarily selected from the pictures which are displayed either before or after a current picture to be coded, as an arbitrary combination. Whereas the reference pictures can be specified for each block that is a fundamental unit of coding and decoding, they are distinguished as a first reference picture and a second reference picture. Here, the first reference picture is a first reference picture to be described in a coded bit stream and the second reference picture is a reference picture to be described after the first reference picture in the coded bit stream. However, the reference pictures need to be already coded and decoded as a condition to code and decode these I-picture, P-picture, and B-picture.
A motion compensation inter-picture prediction coding is used for coding the P-picture or the B-picture. The motion compensation inter-picture prediction coding is a coding method which adopts motion compensation to an inter-picture prediction coding. The motion compensation is a method of reducing the amount of data while increasing prediction precision by estimating an amount of motion (this is referred to as a motion vector, hereinafter) of each part in a picture not by simply predicting a picture from a pixel value of a reference frame, and by performing prediction in consideration of the estimated amount of data. For example, the amount data is reduced by estimating a motion vector of a current picture to be coded and coding a predictive difference between a predicted value which is shifted as much as the estimated motion vector and the current picture to be coded. Since this method requires information about the motion vector at the time of decoding, the motion vector is also coded and recorded or transmitted.
The motion vector is estimated on a macroblock basis. Specifically, a motion vector is estimated by fixing a macroblock of a current picture to be coded, moving a macroblock of the reference picture within a searching range, and finding a position of the reference block which is approximate to the standard block.
FIG. 1A and FIG. 1B are diagrams, each of which showing a configuration of a conventional MPEG-2 stream.
As shown in FIG. 1A and FIG. 1B, the MPEG-2 stream has a following hierarchical structure. A stream is made up of a plurality of Group of Pictures (GOP). A moving picture can be edited and randomly accessed by setting each GOP as a basic unit for coding. Each GOP is made up of plural pictures, each of which is one of an I picture, a P picture and a B picture. In addition, each stream, GOP and picture includes a synchronous signal (sync) indicating a boundary between respective units and a header that is data commonly included in respective units.
FIG. 2 is a diagram showing a configuration of another conventional stream.
This stream corresponds to a JVT (H.264/MPEG-4 AVC) which has been currently developed for standardization by a joint cooperation with ITU-T and ISO/IEC. In JVT, there is no concept of a header and common data is placed in a head of the stream with a name of a parameter set PS. Furthermore, while there is no concept corresponding to GOP, a randomly-accessible unit, which corresponds to GOP, can be structured by dividing data with a special picture unit which can be decoded without relying on other pictures. This unit is called as a random access unit RAU. As the parameter set PS, there are a picture parameter set PPS which is data corresponding to a header of each picture and a sequence parameter set SPS corresponding to a header of a unit that is equal to or greater than GOP in the MPEG-2. Each picture has an identifier attached which indicates one of the picture parameter set PPS and the sequence parameter set SPS to which the picture refers. Specifically, plural picture parameter sets PPS and sequence parameter sets SPS that are different from each other are coded only once, so that excess coding caused by coding the same parameter sets (headers) several times for each picture is omitted and the compression rate is increased, by indicating, with an identifier, one of the parameter sets to which a current picture refers. A picture number PN is an identification number for identifying a picture. Here, the picture number PN is a number indicating a display order of a picture, and is different from PictureNumber showing a decoding order as disclosed in the Non-Patent Reference 1. The sequence parameter set SPS includes the maximum number of reference-available pictures, picture size and the like. The picture parameter set PPS includes a type of variable length coding (switching between Huffman coding and arithmetic coding), an initial value of quantization step, the number of reference pictures, and the like.
FIG. 3A and FIG. 3B are diagrams for explaining a reference state between GOPs used for the conventional MPEG-2 and the like.
FIG. 3A shows a predictive configuration among pictures in a Closed GOP. In this diagram, the diagonally shaded pictures are pictures to be referred to by other pictures. Furthermore, each picture is arranged in display order. In the Closed GOP configuration, B pictures (B6 and B7), which are displayed before the displayed time of an Instantaneous Decoding Refresh (IDR) picture, can be predictive-coded with reference only to the IDR picture and cannot refer pictures that belong to a different GOP. In addition, FIG. 3B shows a predictive configuration among pictures in an Open GOP. In this diagram, the diagonally shaded pictures are pictures to be referred to by other pictures as in the case of FIG. 3A. Furthermore, each picture is arranged in display order. In the Open GOP configuration, B pictures (B6 and B7), which are displayed temporally before an I picture (I8), can be predictive-coded with reference to pictures in the same GOP3 and a P picture (P5) in a GOP4 which is positioned immediately before the GOP3.
In the MPEG-2, P pictures (P2, P5, P11, and P14) can be predictive-coded with reference to only an I picture or a P picture displayed temporally immediately before the current P picture. Furthermore, B pictures (B0, B1, B3, B4, B6, B7, B9, B10, B12 and B13) can be predictive-coded with reference to one of an I picture and a P picture displayed temporally immediately before the current B, and a one of an I picture and a P picture displayed temporally immediately after the current B picture, and their arrangement order in a stream has been determined.
On the other hand, the H.264 standard introduces a very flexible predictive structure among pictures in order to significantly increase coding efficiency (a compression rate). Specifically, a P picture is not restricted to refer to only one picture displayed immediately before the current P picture. A different reference picture for the P picture can be selected for each coded block from among I pictures, P pictures or B pictures despite the display order of these pictures, if those pictures have been decoded and managed in a buffer for reference pictures. Similarly, B picture is not restricted to refer to only one picture each displayed immediately before and after the B picture. Here, a different set of two pictures can be selected for each coded block from among I pictures, P pictures or B pictures despite the display order of those pictures.
In a BD-ROM format standard, pictures can be random access units RAUs even in the case where the pictures are arranged in a stream having the Open GOP structure as similar to the case of the conventional structure. However, following restrictions are for example set for predictive-coding among pictures under the H.264 standard.
(1) Since there is a possibility that B pictures (B6 and B7), which are displayed temporally immediately before an I picture designated as a random access reproduction start point, refer to pictures displayed temporally before and after the B pictures. Therefore, the B pictures are not displayed at the time of random access reproduction.
(2) A picture displayed temporally after the I picture designated as a random access reproduction start point must not refer to a picture displayed temporally before said I picture.
FIG. 4 is a block diagram of an image decoding apparatus which realizes a conventional image decoding method.
The image decoding apparatus shown in FIG. 4 includes a variable length decoding unit 901, a motion compensation unit 902, a picture memory 903, an adding unit 904 and a conversion unit 905.
The variable length decoding unit 901 decodes a stream Str, and outputs a quantized value Qco, a relative index Ind, a picture type Pty, and a motion vector MV. The quantized value Qco, the relative index Ind and the motion vector MV are respectively inputted into and decoded by the picture memory 903, the motion compensation unit 902, and the conversion unit 905.
The conversion unit 905 inverse quantizes the quantized value Qco so as to reconstruct a frequency coefficient, further inverse-frequency converts the frequency coefficient into a pixel differential value, and outputs the resultant to the adding unit 904.
The adding unit 904 adds the pixel differential value and a predictive image outputted from the motion compensation unit 902, and generates a decoded picture Vout. The generated decoded picture Vout is stored into the picture memory 903. Herein, as in the case where plural pictures can be used as reference pictures, each block requires a reference number (a relative index Ind) for specifying an identification number of a picture to be referred to. Accordingly, by obtaining a correspondence between a relative index Ind and a picture number of each picture stored in the picture memory 903, a reference picture can be specified based on the relative index Ind.
The picture memory 903 has a reference picture list, and reference pictures that are decoded and used as references are stored in the reference picture list. Furthermore, the reference picture list has a list (STRL=Short Term Reference List) for performing First In First Out (FIFO) management of reference pictures, and a list (LTRL=Long Term Reference List) for managing whether or not to store, into the list, or delete the explicitly specified reference pictures.
The motion compensation unit 902 extracts an optimum image region for a predictive image from the decoded picture stored in the picture memory 903, based on the motion vector and relative index Ind detected through the aforementioned processing. The motion compensation unit 902 then generates the predictive image and outputs it to the adding unit 904.
Next, it shall be explained about a method of generating a picture number PN which is to be attached to each picture for specifying a picture based on the relative index Ind. Here, summaries of two methods out of three picture number PN generation methods defined by the H. 264 are explained as examples.
(Picture Number Generation Method Example 1)
FIG. 5 is an explanatory diagram of a picture number generation method example 1.
Hereinafter, the picture number generation method example 1 shall be explained with reference to FIG. 5. In the generation method example 1, a picture number PN is generated by adding an offset value Msb managed as a variable in decoding processing and a difference value (a count value Isb) from the offset value Msb attached to each picture.
In FIG. 5, each picture Pic is one of I, P, and B picture, and is arranged in display order on a screen. However, in decoding order, it is assumed that pictures are decoded in the following order of IDR, . . . , P3, B1, B2, P6, B4, B5, I9, B7, B8, and . . . . Both Msb1 and Msb2 indicate values of offset values Msb, where Msb1=96 and Msb2=112.
As shown in a head of the picture Pic arrangement, an IDR picture is placed, as an I picture to be decoded at first in the access unit AU, at a start point of a stream or in a place where there is no reference relationship between a current GOP to be decoded and a GOP which is positioned immediately before the current GOP. First, when the IDR picture is decoded, an offset value Msb is initialized to 0 and the count value Isb is also 0. Therefore, a picture number PN of said IDR picture becomes 0. For example, when the count value Isb is 3 for the next P picture, Msb+Isb=0+3=3 is obtained. Therefore, a picture number PN for said P picture is 3.
Furthermore, when the count value Isb is 1 for the subsequent B picture, 0+1=1 is obtained so that a picture number PN for said B picture is 1. In addition, when the count value Isb is 2 for the next subsequent B picture, 0+2=2 is obtained so that a picture number PN for that B picture is 2.
Here, when the count value Isb reaches a predetermined value (a maximum count value Lsb) while repeating the aforementioned operations, the maximum count value Lsb is added to the offset value Msb and the count value Isb is managed so as not to exceed the maximum count value Lsb, in order to prevent an increase in the number of bits of the count value Isb and a decrease in coding efficiency. For example, in the case where the maximum count value Lsb=16, if the count value Isb reaches the maximum count value Lsb for six times, it indicates that the offset value Msb has been updated to the value of 0+16×6=96.
Furthermore, when the picture P3 is decoded, in the case where the offset value Msb has been updated to 96 and the count value Isb equals to 12, the picture number PN of the picture P3 is obtained as 96+12=108. Similarly, for B1, B2, P6, B4 and B5 pictures (in decoding order), when the respective count values Isb are 10, 11, 15, 13 and 14, the picture numbers PN are respectively 107, 108, 111, 109 and 110.
Furthermore, for I9, B7, B8, B10 and subsequent pictures, the count value Isb exceeds the maximum count value Lsb of 16 if the count value Isb is continued to be increased. Therefore, the count value Isb is controlled so as to be the value less than 16 by updating the offset value Msb as obtained by the offset value of Msb=Msb1+maximum count value Lsb=96+16=112. Accordingly, for the pictures I9, B7, B8 and B10, the respective count values Isb are 2, 0, 1 and 3, and the respective picture numbers PN are 114, 112, 113 and 115.
As described in the above, a picture number PN is generated using the count value Isb, which is attached to each picture, and the offset value Msb, which is updated and managed in decoding processing, and an order of displaying on a screen and reference pictures can be managed using the generated picture number PN.
(Picture Number Generation Method Example 2)
FIG. 6 is an explanatory diagram of the picture number generation method example 2.
In the picture number generation method example 2 shown in FIG. 6, the offset value FNO managed as a variable in decoding processing is added to the frame number (a count value) fn attached to each picture, and generates a picture number PN by doubling the sum. It should be noted that, in the case of an unreferenced picture, a picture number PN is generated for that picture by subtracting 1 from the calculated sum.
FNO1 and FNO2 are values of offset values FNO. In the example shown in FIG. 6, FNO1 is 96 and FNO2 is 112.
First, when an IDR picture is decoded, an offset value FNO is initialized to 0, and a frame number fn becomes 0 as well as the picture number PN for said IDR picture. When the frame number fn is 1 for the next B picture, 2×(offset value FNO+frame number fn)=2×(0+1)=2 is obtained and the picture number PN for said B picture becomes 2.
Furthermore, when the frame number fn is 2 for the subsequent B picture, 2×(0+2)=4 is obtained, and the picture number PN for that B picture becomes 4. When the frame number fn is 3 for the next subsequent P picture, 2×(0+3)=6 is obtained and the picture number PN for said P picture becomes 6.
In the case where the frame number fn of a current picture is smaller than the frame number of a picture which is decoded temporally immediately before the current picture, the offset value FNO is updated by adding the maximum frame number MFN to the offset value FNO.
This is a mechanism for reducing the amount of coded bits by restricting an available value range for the frame number to be equal to or smaller than a predetermined value, and for also indicating a large picture number PN. In FIG. 6, for example, it is indicated that the offset value FNO is FNO1 when the frame number fn of a current picture became smaller than the frame number of a picture that is decoded temporally immediately before the current picture for six times, and the offset value FNO1 is updated to the state of FNO2 when the above situation occurred for the seventh time.
For example, when a picture B1 in the picture Pic is decoded, if the offset value FNO is updated to 96 and the frame number fn is 10, the picture number PN of the picture B1 is obtained by 2×(96+10)=212. Similarly, in the case where pictures B2, P3, B4, B5 and P6 (in decoding order) have respective frame numbers fn of 11, 12, 13, 14 and 15, their picture numbers PN are respectively 214, 216, 218, 220 and 222.
Furthermore, when a picture B7 is decoded, if the frame number fn is 0 which is the value smaller than 15 that is the frame number of a picture positioned immediately before the picture B7, the offset value FNO is updated to FNO=FNO1+maximum frame number MFN=96+16=112. Accordingly, the picture number PN of the picture B7 becomes 2×(112+0)=224. Similarly, when the pictures B8, I9 and B10 have respective frame numbers fn of 1, 2 and 3, respective picture numbers PN of 226, 228 and 230 are obtained.
Thus, as similar to the case of the picture number generation method example 1, the picture number PN can be generated using a value of a frame number fn attached to each picture and an offset value FNO which is updated and managed in decoding processing. Therefore, the reference pictures can be managed using the generated picture number. It should be noted that, in the picture number generation method example 2, the decoding order is same as the display order.    Non-Patent Reference 1: ISO/IEC 14496-10, International Standard: “Information technology—Coding of audio-visual objects—Parts 10: Advanced video coding” (Dec. 1, 2003), pp. 82-100.