In the age of multimedia that integrally handles audio, video and pixel values of others, existing information media, i.e. newspaper, magazine, television, radio, telephone and other means through which information is conveyed to people, have recently come to be included in the scope of multimedia. In general, multimedia refers to representing not only characters, but also graphics, voices, and especially pictures and the like together in association with one another. However, in order to include the aforementioned existing information media in the scope of multimedia, it becomes absolutely necessary to represent such information in digital form.
However, when calculating the amount of information contained in each of the aforementioned information media as the amount of digital information, while the amount of information per character is 1˜2 bytes in the case of characters, the amount of information to be required is 64 Kbits or over per second in the case of voices (telephone quality), and 100 Mbits or over per second in the case of a moving picture (current television reception quality). Thus, it is not realistic for the aforementioned information media to handle such an enormous amount of information as it is in digital form. For example, although video phones are already in the actual use by use of Integrated Services Digital Network (ISDN) that offers a transmission speed of 64 Kbit/s˜1.5 Mbit/s, it is not possible to transmit video of televisions and cameras directly through ISDN.
Against this backdrop, information compression techniques have become required, and moving picture compression techniques compliant with H.261 and H.263 standards recommended by ITU-T (International Telecommunication Union-Telecommunication Standardization Sector) are employed for video phones, for example. Moreover, according to an information compression technique compliant with the MPEG-1 standard, it is possible to store picture information into an ordinary music CD (compact disc) together with audio information.
Here, MPEG (Moving Picture Experts Group) is an international standard on compression of moving picture signals standardized by ISO/IEC (International Organization for Standardization/International Electrotechnical Commission), and MPEG-1 is a standard for compressing television signal information approximately into one hundredth so that a moving picture signal can be transmitted at a rate of 1.5 Mbit/s. Furthermore, since a transmission speed achieved by the MPEG-1 standard is a middle-quality speed of about 1.5 Mbit/s, MPEG-2, which was standardized with a view to satisfying requirements for further improved picture quality, allows data transmission equivalent in quality to television broadcasting through which a moving picture signal is transmitted at a rate of 2˜15 Mbit/s. Moreover, MPEG-4 was standardized by the working group (ISO/IEC JTC1/SC29/WG11) which promoted the standardization of MPEG-1 and MPEG-2. MPEG-4, which provides a higher compression ratio than that of MPEG-1 and MPEG-2 and which enables an object-based coding/decoding/operation, is capable of providing a new functionality required in this age of multimedia. At the beginning stage of standardization, MPEG-4 aimed at providing a low bit rate coding method, but it has been extended as a standard supporting more general coding that handles interlaced images as well as high bit rate coding. Currently, an effort has been made jointly by ISO/IEC and ITU-T for standardizing MPEG-4 AVC and ITU-T H.264 as picture coding methods of the next generation that offer a higher compression ratio.
In general, in coding of a moving picture, the amount of information is compressed by reducing redundancies in temporal and spatial directions. Therefore, in inter picture prediction coding aiming at reducing temporal redundancies, motion estimation and the generation of a predicative image are carried out on a block-by-block basis with reference to forward or backward picture(s), and coding is then performed on the differential value between the obtained predictive image and an image in the current picture to be coded. Here, “picture” is a term denoting one image. In the case of a progressive image, “Picture” means a frame, whereas it means a frame or fields in the case of an interlaced image. Here, “interlaced image” is an image of a frame composed of two fields which are separated in capture time. In coding and decoding of an interlaced image, it is possible to handle one frame as (1) a frame as it is, (2) two fields, or (3) a frame structure or a field structure on a per-block basis within the frame.
A picture to be coded using intra picture prediction without reference to any pictures shall be referred to as an I picture. A picture to be coded using inter picture prediction with reference to only one picture shall be referred to as a P picture. And, a picture to be coded using inter picture prediction with reference to two pictures at the same time shall be referred to as a B picture. It is possible for a B picture to refer to two pictures which can be arbitrarily combined from forward/backward pictures in display order. Reference images (reference pictures) can be determined for each block serving as a basic coding/decoding unit. Distinction shall be made between such reference pictures by calling a reference picture to be described earlier in a coded bitstream as a first reference picture, and by calling a reference picture to be described later in the bitstream as a second reference picture. Note that as a condition for coding and decoding these types of pictures, pictures used for reference are required to be already coded and decoded.
P pictures and B pictures are coded using motion compensated inter picture prediction. Coding by use of motion compensated inter picture prediction is a coding method that employs motion compensation in inter picture prediction coding. Unlike a method for performing prediction simply based on pixel values in a reference picture, motion estimation is a technique capable of improving prediction accuracy as well as reducing the amount of data by estimating the amount of motion (hereinafter referred to as “motion vector”) of each part within a picture and further by performing prediction in consideration of such amount of motion. For example, it is possible to reduce the amount of data through motion compensation by estimating motion vectors of the current picture to be coded and then by coding prediction residuals between prediction values obtained by shifting only the amount of the respective motion vectors and the current picture to be coded. In this technique, motion vectors are also recorded or transmitted in coded form, since motion vector information is required at the time of decoding.
Motion vectors are estimated on a per-macroblock basis. More specifically, a macroblock shall be previously fixed in the current picture to be coded, so as to estimate motion vectors by finding the position of the most similar reference block of such macroblock within the search area in a reference picture.
FIG. 1 is a diagram illustrating an example data structure of a bitstream. As FIG. 1 shows, the bitstream has a hierarchical structure such as below. The bitstream (Stream) is formed of more than one group of pictures (GOP). By using GOPs as basic coding units, it becomes possible to edit a moving picture as well as to make a random access. Each GOP is made up of plural pictures, each of which is one of I picture, P picture, and B picture. Each picture is further made up of plural slices. Each slice, which is a strip-shaped area within each picture, is made up of plural macroblocks. Moreover, each stream, GOP, picture, and slice includes a synchronization signal (sync) for indicating the ending point of each unit and a header (header) which is data common to said each unit.
Note that when data is carried not in a bitstream that is a sequence of streams, but in a packet and the like that is a piecemeal unit, the header and the data portion, which is the other part than the header, may be carried separately. In such case, the header and the data portion shall not be incorporated into the same bitstream, as shown in FIG. 1. In the case of a packet, however, even when the header and the data portion are not transmitted contiguously, it is simply that the header corresponding to the data portion is carried in another packet. Therefore, even when the header and the data portion are not incorporated into the same bitstream, the concept of a coded bitstream described with reference to FIG. 1 is also applicable to packets.
FIG. 2 is a block diagram showing the construction of an existing picture coding apparatus. In this drawing, a picture coding apparatus 1 is an apparatus for performing compression coding on an input picture signal Vin, so as to output a coded picture signal Str which has been coded into a bitstream by performing variable length coding and the like. Such picture coding apparatus 1 is comprised of a motion estimation unit ME, a motion compensation unit MC, a subtraction unit Sub, an orthogonal transformation unit T, a quantization unit Q, an inverse quantization unit IQ, an inverse orthogonal transformation unit IT, an addition unit Add, a picture memory PicMem, a switch SW, and a variable length coding unit VLC.
The picture signal Vin is inputted to the subtraction unit Sub and the motion estimation unit ME. The subtraction unit Sub calculates, as a prediction error, a difference between each image in the input picture signal Vin and each predictive image on a block-by-block basis, and outputs the calculated prediction error to the orthogonal transformation unit T. The orthogonal transformation unit T performs orthogonal transformation on the prediction error to transform it into frequency coefficients, and outputs such frequency coefficients to the quantization unit Q. The quantization unit Q quantizes such inputted frequency coefficients, and outputs the quantized values Qcoef to the variable length coding unit VLC.
The inverse quantization unit IQ performs inverse quantization on the quantized values Qcoef so as to turn them into the frequency coefficients, and outputs such frequency coefficients to the inverse orthogonal transformation unit IT. The inverse orthogonal transformation unit IT performs inverse frequency transformation on the frequency coefficients so as to transform them into a prediction error, and outputs such prediction error to the addition unit Add. The addition unit Add adds each prediction error and each predictive image outputted from the motion estimation unit MC, so as to form a decoded image. The switch SW turns to ON when it is indicated that such decoded image should be stored, and such decoded image is to be stored into the picture memory PicMem.
Meanwhile, the motion estimation unit ME, which receives the picture signal Vin on a macroblock basis, detects an image area closest to such input image signal Vin from the among decoded pictures stored in the picture memory PicMem, and determines motion vector(s) MV indicating the position of such area. Motion vectors are estimated for each block, which is obtained by further dividing a macroblock. When this is done, it is possible to use more than one picture as reference pictures. A reference picture used for estimating a motion vector shall be identified by an identification number (reference index Index). The picture numbers of the respective pictures stored in the picture memory PicMem are associated with reference indices Index.
The motion compensation unit MC reads out an optimum picture as a predictive picture from among the decoded pictures stored in the picture memory PicMem, based on the motion vectors detected in the above processing and the reference indices Index.
The variable length coding unit VLC performs variable length coding on each of the quantized values Qcoef, reference indices Index, and motion vectors MV so as to output them as a coded stream Str.
FIG. 3 is a block diagram showing the construction of an existing picture decoding apparatus. In this drawing, units that operate in the same manner as that of the units in the picture coding apparatus shown in FIG. 2 are assigned the same numbers, and descriptions thereof are omitted.
The variable length decoding unit VLD decodes the coded stream Str into quantized values Qcoef, reference indices Index, and motion vectors MV. Those quantized values Qcoef, reference indices Index, and motion vectors MV are inputted into the picture memory PicMem, the motion compensation unit MC, and the inverse quantization unit IQ, where decoding processing is performed. Processing to be performed in such decoding processing is equivalent to that performed in the existing picture coding apparatus shown in FIG. 2.    (Non-patent document) ITU-T Rec. H.264|ISO/IEC 14496-10 AVC Joint Final Committee Draft of Joint Video Specification (2002-8-10).
However, according to the existing picture coding apparatus, it is difficult to use a high-compression ratio to all images containing many pixels and to all images of a variety of contents. It is thus required for such existing picture coding apparatus to be capable of improving image quality as well as offering a high compression ratio.
To be more specific, the existing picture coding apparatus uses a fixed sized block as a unit of performing orthogonal transformation (orthogonal transformation size). This makes it difficult to achieve a high compression ratio to a moving picture signal including pictures with a variety of contents such as high- and low-resolution pictures as well as pictures with many and few variations in brightness and colors. The reason is that an orthogonal transformation size is 8×8 pixels in the case of MPEG-1, MPEG-2, and MPEG-4, for example, whereas an orthogonal transformation size is 4×4 pixels in the case of MPEG-4 AVC, i.e. ITU-T H.264. On that point, since pixels are more strongly correlated with one another and the density among pixels of a display device (e.g. CRT) is higher compared with a low-resolution image, it is deemed desirable to use a larger orthogonal transformation size for a high-resolution image (e.g. HDTV). Moreover, it is also desirable in many cases that a larger orthogonal transformation size be used for content with a smaller number of high frequency components, whereas a smaller orthogonal transformation size be used for content with a larger number of high frequency components.