Recently, the multi-media era has come in which sound, pictures and other pixel values are integrated into one media, and conventional information media as communication tools like newspapers, magazines, TV, radio and telephone are regarded as the targets of multi-media. Generally, multi-media is a form of simultaneous representation of not only characters but also graphics, sound, and especially pictures. In order to handle the above-described conventional information media as multi-media, it is a requisite to represent the information digitally.
However, it is unrealistic to directly process a huge amount of information digitally using the above-described conventional information media because, when calculating the data amount of each information medium described above as digital data amount, data amount per character is 1 to 2 bytes while that of sound per second is not less than 64 Kbits (telephone speech quality) and that of moving pictures per second is not less than 100 Mbits (present TV receiving quality). For example, a TV telephone has already become commercially practical thanks to Integrated Services Digital Network (ISDN) with a transmission speed of 64 kbps to 1.5 Mbps, but it is impossible to transmit moving pictures of TV camera as they are using ISDN.
That is why information compression technique is necessary. For example, a moving picture compression technique standard of H.261 or H.263 that is recommended by the International Telecommunication Union-Telecommunication Standardization Sector (ITU-T) is used for TV telephones. Also, with the information compression technique of the MPEG-1 standard, it becomes possible to store image information, together with sound information, in a normal CD (Compact disc) for music.
Here, Moving Picture Experts Group (MPEG) is an international standard to digitally compress moving picture signals, and has been standardized by the ISO/IEC (the International Standardization Organization/International Engineering Consortium). MPEG-1 is the standard to compress moving picture signals down to 1.5 Mbps, that is, to compress TV signal information to about one hundredth. Also, the quality that satisfies the MPEG-1 standard is medium level that can be realized at a transmission rate of about 1.5 Mbps. MPEG-2 is thus standardized in order to satisfy the need for higher picture quality, and it compresses moving picture signals to 2 to 15 Mbps. At present, the work group (ISO/IEC JTC1/SC29/WG11), which standardized MPEG-1 and MPEG-2, has standardized MPEG-4 with a higher compression rate. The MPEG-4 standard (i) achieves a compression rate higher than those of MPEG-1 standard and MPEG-2 standard, (ii) enables coding, decoding and performing operations on an object-by-object basis, and (iii) realizes new functions necessary in this multimedia era. The initial object of MPEG-4 standard is to standardize a coding method of pictures with low bit rates, but the object is extended to a general purpose coding method of interlace pictures with high bit rates. After that, ISO/IEC and ITU-T, in combination, has standardized MPEG-4 AVC (Advanced Video Coding) as a next generation picture coding method of pictures with a high compression rate. This is expected to be used for next generation optical disc related apparatuses or in broadcasting for mobile terminals.
Generally, in coding moving pictures, information amount is compressed by reducing temporal and spatial redundancies. In the inter picture prediction coding aiming to reduce temporal redundancies, motion estimation and prediction picture generation are performed on a block-by-block basis with reference to a forward picture or a backward picture, and coding is performed on the differential value between the obtained prediction picture and the picture to be coded. Here, “Picture” used here is a term representing one picture. In a progressive picture, a picture means a frame, but in an interlace picture, it means a frame or a field. An “interlace picture” described here means a frame composed of two fields with a slight time lag. In the coding and decoding processes of interlace pictures, it is possible to process a frame as it is, as two fields, or on a frame-by-frame or on a field-by-field of each block in a frame.
The picture for performing intra prediction coding without referring to any reference picture is called Intra Coded Picture (I picture). Also, the picture for performing inter prediction coding referring to only a picture is called Predictive Coded Picture (P picture). Also, the picture for performing inter prediction coding referring to two reference pictures simultaneously is called Bi-predictive Coded Picture (B picture). A B picture can refer to two pictures selected as an arbitrary combination of a forward picture and a backward picture in display time. Such two reference pictures can be specified on a block-by-block basis, the block being a basic unit of coding and decoding. Those reference pictures are distinguished from each other as follows: the reference picture described earlier in the coded bit stream is called first reference picture, and the other reference picture described later is called second reference picture. Note that such reference pictures must have already been coded or decoded in order to code or decode P pictures and B pictures.
Motion compensation inter prediction coding is used for coding of P pictures and B pictures. Motion compensation intra prediction coding is an intra prediction coding method in which motion compensation is applied. Motion compensation is a method for improving prediction precision and reducing data amount by estimating motion amount (called motion vector hereafter) of each block of a picture and by performing prediction coding considering the motion vector. For example, data amount is reduced by estimating motion vectors of pictures to be coded and by coding each prediction residual between each prediction value that is shifted by the amount of each motion vector and each current picture to be coded. In the case of this method, since motion vector information is needed in decoding, motion vectors are also coded, and recorded or transmitted.
Motion vectors are estimated on a macro block by macro block basis. To be more specifically, motion vectors are estimated by fixing the macro block of a picture to be coded, moving the macro block of a reference picture within the search range, and finding the location of the reference block that is closest to the standard block.
FIGS. 1A and 1B are structural diagrams of conventional MPEG-2 streams respectively. As shown in FIG. 1B, an MPEG-2 stream has a hierarchical structure like will be described below. A stream is composed of a Group of Pictures (called GOP hereafter). The use of a GOP as a basic unit in coding processing enables editing a moving picture or performing a random access. A GOP is made up of I pictures, P pictures and B pictures. A stream, a GOP and a picture further includes a synchronous signal (sync) indicating a border of units and a header indicating the data common in the units, the units here being a stream, a GOP and a picture respectively.
FIGS. 2A and 2B respectively show examples indicating how to perform inter picture prediction coding that is used in MPEG-2. The diagonally-shaded pictures in the figure are those pictures to be referred to by other pictures. As shown in FIG. 2A, in prediction coding in MPEG-2, P pictures (P0, P6, P9, P12 and P15) can refer to only a single picture selected as an immediately forward I picture or P picture in display time. Also, B pictures (B1, B2, B4, B5, B7, B8, B10, B11, B13, B14, B16, B17, B19, and B20) can refer to two pictures selected as a combination of an immediately forward I picture or P picture and an immediately backward I picture or P picture. Further, the order of pictures to be placed in a stream is determined. I pictures and a P picture are placed in the order of display time, and each B picture is placed immediately after an I picture to be displayed immediately after the B picture or immediately after a P picture. As a structural example of a GOP, as shown in FIG. 2B, pictures from I3 to B14 are grouped into a single GOP.
FIG. 3A is a structural diagram of an MPEG-4 AVC stream. There is no concept equivalent to a GOP in the MPEG-4 AVC. However, since it is possible to construct a randomly-accessible unit equivalent to a GOP by segmenting data on the basis of a special picture that can be decoded without depending on other pictures, the unit will be called RAU (Random Access Unit) hereafter. In other words, a random access unit RAU is a coded picture group starting with an intra coded picture that can be decoded without depending on any picture.
Next, the access unit that is a basic unit in handling a stream (simply called AU hereafter) will be described below. An AU is the unit for storing coded data equivalent to one picture, and includes a parameter set PS, slice data and the like. There are two types of parameter set PSs. One of them is a picture parameter set PPS (simply called PPS hereafter) which is data equivalent to the header of each picture. The other is a sequence parameter set SPS (simply called SPS hereafter) which is equivalent to the header included in a unit of a GOP or more in MPEG-2. An SPS includes the maximum number of reference pictures, a picture size and the like. On the other hand, a PPS includes a variable length coding type, an initial value of the quantization step, the number of reference pictures and the like. Each picture is assigned an identifier indicating which of the above-described PPS and SPS is referred to. Also, a frame number FN that is the identification number for identifying a picture included in slice data. Note that a sequence starts with a special picture at which all the statuses needed for decoding are reset as will be described below, and it is made up of a group of pictures that starts with a special picture and ends with a picture that is placed immediately before the next special picture.
There are two types of I pictures in MPEG-4 AVC. They are an Instantaneous Decoder Refresh (IDR) and the rest. An IDR picture is the I picture that can decode all the pictures placed after the IDR picture in a decoding order, without referring to pictures placed before the IDR picture in the decoding order, in other words, it is the I picture at which statuses needed for decoding are reset. An IDR picture corresponds to the top I picture of an MPEG-2 closed GOP. A sequence in MPEG-4 AVC starts with an IDR picture. In the case of an I picture that is not an IDR picture, a picture placed after the I picture in the decoding order may refer to a picture placed before the I picture in the decoding order. The respective picture types will be defined below. An IDR picture and an I picture are the pictures that are composed of only I slices. A P picture is the picture that may be composed of P slices and I slices. A B picture is the picture that may be composed of B slices, P slices and I slices. Note that the slices of an IDR picture are stored in a NAL unit whose type is different from that of the NAL unit where the slices of a non-IDR picture are stored. Here, a NAL unit is a sub-picture unit.
In an AU in MPEG-4 AVC, not only the data necessary for decoding but also supplemental information and border information of the AU can be included. Such supplemental information is called Supplemental Enhancement Information (SEI), and it is unnecessary for decoding of slice data. All the data such as a parameter set PS, slice data, a SEI are stored in a Network Abstraction Layer (NAL) unit, that is, NALU. A NAL unit is composed of a header and a payload. A header includes a field indicating data type to be stored (called NAL unit type hereafter). The values of NAL unit types are defined respectively for the types of data such as a slice or a SEI. Referring to such a value of a NAL unit type enables identifying the type of data to be stored in the NAL unit. The header of a NAL unit includes a field called nal_ref_idc. It is defined that a nal_ref_idc field is a 2-bit-field and takes a value of 0, 1 or more depending on the types of NAL units. For example, The NAL unit of an SPS or a PPS takes 1 or more. In the case of the NAL unit of a slice, a slice to be referred to by other slices takes 1 or more, while the slice not to be referred to takes 0. Also, the NAL unit of a SEI always takes 0.
One or more SEI messages can be stored in the NAL unit of a SEI. A SEI message is composed of a header and a payload, and the type of information to be stored in the payload is identified by the type of a SEI message indicated in the header. Decoding an AU means decoding the slice data in an AU, and displaying an AU means displaying the decoding result of the slice data in the AU hereafter.
Here, since a NAL unit does not include information for identifying a NAL unit border, it is possible to add border information to the top of each NAL unit at the time of storing a NAL unit as an AU. In handling an MPEG-4 AVC stream in an MPEG-2 Transport Stream (TS) or an MPEG-2 Program Stream (PS), a start code prefix shown as 3 bytes of 0x000001 is added to the top of a NAL unit. Also, it is defined that a NAL unit indicating an AU border must be inserted into the top of an AU in an MPEG-2 TS or PS, such an AU being called Access Unit Delimiter.
Conventionally, various kinds of technique related to moving picture coding like this have been proposed (For example, refer to Patent Document 1).
Patent Document 1: Japanese Laid-Open Patent No. 2003-18549 publication.
FIG. 4 is a block diagram of a conventional moving picture coding apparatus.
The moving picture coding apparatus 1 is an apparatus that outputs a coded stream Str obtained by converting, through compression coding, an input video signal Vin to be inputted into a bit stream of a variable length coded stream or the like. The moving picture coding apparatus includes a prediction structure determination unit PTYPE, a motion vector estimation unit ME, a motion compensation unit MC, a subtraction unit Sub, an orthogonal transform unit T, a quantization unit Q, an inverse quantization unit IQ, an inverse orthogonal transform unit IT, an addition unit Add, a picture memory PicMem, a switch and a variable length coding unit VLC.
The input video signal Vin is inputted into the subtraction unit Sub and the motion vector estimation unit ME. The subtraction unit Sub calculates the differential value between the inputted input video signal Vin and the prediction picture, and outputs it to the orthogonal transform unit. The orthogonal transform unit T converts the differential value into a frequency coefficient, and outputs it to the quantization unit Q. The quantization unit Q performs quantization on the inputted frequency coefficient, and outputs a quantization value Qcoef to the variable length coding unit.
The inverse quantization unit IQ performs inverse quantization on the quantization value Qcoef to reconstruct the frequency coefficient, and outputs it to the inverse orthogonal transform unit IT. The inverse orthogonal transform unit IT performs inverse frequency transform to transform the frequency coefficient into a pixel differential value, and outputs it to the addition unit Add. The addition unit Add adds the pixel differential value to the prediction picture to be outputted from the motion compensation unit MC to make a decoded picture. The switch SW is turned ON when storage of the decoded picture is instructed, and the decoded picture is stored in the picture memory PicMem.
On the other hand, the motion vector estimation unit ME, in which an input video signal Vin is inputted on a macro block by macro block basis, searches the decoded picture stored in the picture memory PicMem, and estimates the picture area that is closest to the input picture signal, and consequently determines the motion vector MV indicating the position. Motion vector estimation is performed on a block-by-block basis, the block being a segmented part of a macro block. Since plural pictures can be used as reference pictures at this time, identification numbers for specifying pictures to be referred to (relative indexes) are needed on a block-by-block basis. It becomes possible to specify reference pictures by calculating the picture numbers indicated by the relative indexes, such picture numbers being assigned to the respective pictures in a picture memory PicMem.
The motion compensation unit MC selects the picture area that is optimum as a prediction picture from the decoded pictures stored in the picture memory PicMem.
The prediction structure determination unit PTYPE instructs the motion vector estimation unit ME and the motion compensation unit MC to perform intra picture coding on the target picture as a randomly-accessible special picture using its picture type Ptype, in the case where a random access unit start picture RAUin indicates that the random access unit RAU starts with the current picture, and instructs the variable length coding unit VLC to code the picture type P-type.
The variable length coding unit VLC performs variable length coding on the quantization value Qcoef, the relative index Index, the picture type Ptype and the motion vector MV to make a coded stream Str.
FIG. 5 is a block diagram of a conventional moving picture decoding apparatus 2. This moving picture decoding apparatus 2 includes a variable length decoding unit VLD, a picture memory PicMem, a motion compensation unit MC, an addition unit Add, an inverse orthogonal transform unit IT and an inverse quantization unit IQ. Note that, in the figure, these processing units that perform the same operations as those processing units in a conventional moving picture coding apparatus as shown in the block diagram of FIG. 4 are assigned the same reference numbers, and the descriptions on them will be omitted.
The variable length decoding unit VLD decodes a coded stream Str, and outputs the quantization value Qcoef, the relative index Index, the picture type Ptype and the motion vector MV. The quantization value Qcoef, the relative index Index and the motion vector MV are inputted into the picture memory PicMem, the motion compensation unit MC and the inverse quantization unit IQ respectively, and then decoding processing on them is performed. Such operations of a conventional moving picture coding apparatus have already been described using the block diagram of FIG. 4.
A random access unit RAU shows that decoding can be performed starting with the top AU in the random access unit. However, as a conventional MPEG-4 AVC stream allows very flexible prediction structures, a storage apparatus having an optical disc or a hard disc cannot obtain information for determining the AUs to be decoded or displayed at the time of variable-speed playback or reverse playback.
FIGS. 6A and 6B are examples of the prediction structures of AUs. Here, a picture is stored in each AU. FIG. 6A is the prediction structure of AUs used in an MPEG-2 stream. The diagonally-shaded pictures in the figure are pictures to be referred to by other AUs. In the MPEG-2, the AUs of P pictures (P4 and P7) can perform prediction coding only referring to a single AU selected as the AU of an immediately forward I picture or P picture in display time. Also, the AUs of B pictures (B1, B2, B3, B5 and B6) can perform prediction coding only referring to two AUs selected as a combination of AUs of an immediately forward I picture or P picture and an immediately backward I picture or P picture in display time. Further, the order of pictures to be placed in a stream is predetermined as follows: the AUs of an I picture and P pictures are placed in the order of display time; and each of the AUs of B pictures are placed immediately after the AUs of the I picture or one of the P pictures that is placed immediately after the AU of each B picture. Consequently, decoding can be performed in the following three ways: (1) all the pictures are decoded; (2) only the AUs of an I picture and P pictures are decoded and displayed; and (3) only the AU of an I picture is decoded and displayed. Therefore, the following three types of playback can easily be performed using: (1) normal playback, (2) medium-speed playback, and (3) high-speed playback.
In the MPEG-4 AVC, prediction where the AU of a B picture refers to the AU of a B picture can be performed. FIG. 6B is an example of prediction structure in an MPEG-4 AVC stream, and the AUs of B pictures (B1 and B3) refer to the AU (B2) of the B picture. In this example, the following four types of decoding or display can be realized: (1) all the pictures are decoded; (2) only AUs, of an I picture, P pictures and B pictures, which are referred to are decoded and displayed; (3) only AUs of an I picture and P pictures are decoded and displayed; (4) only the AU of an I picture is decoded and displayed.
In addition, in the MPEG-4 AVC, the AU of a P picture can refer to the AU of a B picture. As shown in FIG. 7, the AU of a P picture (P7) can refer to the AU of a B picture (B2). In this case, the AU of a P picture (P7) can be decoded only after the AU of a B picture (B2) is decoded. Therefore, the following three types of decoding or display can be realized: (1) all the pictures are decoded; (2) only AUs, of an I picture, P pictures and B pictures, which are referred to are decoded and displayed; (3) only the AU of an I picture is decoded and displayed.
In this way, as various prediction structures are allowed in the MPEG-4 AVC, analysis of slice data and judgment of the prediction structure must be made in order to know the reference relationship between AUs. This entails a problem that AUs to be decoded or displayed cannot be determined based on a rule that is predetermined depending on a playback speed at the time of performing jump-in playback, variable-speed playback and reverse playback, unlike in the case of the MPEG-2.