In the age of multimedia which integrally handles audio, video and other pixel values, existing information media, i.e., newspaper, magazine, television, radio, telephone and other means through which information is conveyed to people, have recently come to be included in the scope of multimedia. Generally, multimedia refers to something that is represented by associating not only characters, but also graphics, audio, and especially pictures and the like together. However, in order to include the aforementioned existing information media into the scope of multimedia, it appears as a prerequisite to represent such information in digital form.
However, when calculating the amount of information contained in each of the aforementioned information media as the amount of digital information, while the amount of information per character is 1 to 2 bytes in the case of characters, the amount of information to be required is 64 Kbits per second in the case of audio (telephone quality), and 100 Mbits per second in the case of moving pictures (current television reception quality). Therefore, it is not realistic for the aforementioned information media to handle such an enormous amount of information as it is in digital form. For example, although video phones are already in the actual use by using Integrated Services Digital Network (ISDN) which offers a transmission speed of 64 Kbits/s to 1.5 Mbits/s, it is not practical to transmit video of televisions and cameras directly through ISDN.
Against this backdrop, information compression techniques have become required, and moving picture compression techniques compliant with H.261 and H.263 standards recommended by ITU-T (International Telecommunication Union—Telecommunication Standardization Sector) are employed for video phones, for example. Moreover, according to information compression techniques compliant with the MPEG-1 standard, it is possible to store picture information into an ordinary music CD (compact disc) together with sound information.
Here, MPEG (Moving Picture Experts Group) is an international standard on compression of moving picture signals standardized by ISO/IEC (International Organization for Standardization, International Electrotechnical Commission), and MPEG-1 is a standard for compressing television signal information approximately into one hundredth so that moving picture signals can be transmitted at a rate of 1.5 Mbit/s. Furthermore, since a transmission speed achieved by the MPEG-1 standard is a middle-quality speed of about 1.5 Mbit/s, MPEG-2, which was standardized with a view to satisfying requirements for further improved picture quality, allows data transmission equivalent in quality to television broadcasting through which moving picture signals are transmitted at a rate of 2 to 15 Mbit/s. Moreover, MPEG-4 was standardized by the working group (ISO/IEC JTC1/SC29/WG11) which promoted the standardization of MPEG-1 and MPEG-2. MPEG-4, which provides a higher compression ratio than that of MPEG-1 and MPEG-2 and which enables an object-based coding/decoding/operation, is capable of providing a new functionality required in this age of multimedia. At the beginning stage of standardization, MPEG-4 aimed at providing a low bit rate coding method, but it has been extended as a standard supporting more general coding that handles interlaced images as well as high bit rate coding. Currently, an effort has been made jointly by ISO/IEC and ITU-T for standardizing MPEG-4 AVC and ITU-T H.264 as picture coding methods of the next generation that offer a higher compression ratio. As of August 2002, a committee draft (CD) is issued for a picture coding method of the next generation.
In general, in coding of a moving picture, the amount of information is compressed by reducing redundancies in temporal and spatial directions. Therefore, in inter picture prediction coding aiming at reducing temporal redundancies, motion estimation and generation of a predicative image are carried out on a block-by-block basis with reference to forward or backward picture(s), and coding is then performed on the difference value between the obtained predictive image and an image in the current picture to be coded. Here, “picture” is a term denoting one image. In the case of a progressive image, “picture” means a frame, whereas it means a frame or fields in the case of an interlaced image. Here, “interlaced image” is an image of a frame composed of two fields which are separated in capture time. In coding and decoding of an interlaced image, it is possible to handle one frame as a frame as it is, as two fields, or as a frame structure or a field structure on a per-block basis within the frame.
A picture to be coded using intra picture prediction without reference to any pictures shall be referred to as an I picture. A picture to be coded using inter picture prediction with reference to only one picture shall be referred to as a P picture. And, a picture to be coded using inter picture prediction with reference to two pictures at the same time shall be referred to as a B picture. It is possible for a B picture to refer to two pictures which can be arbitrarily combined from forward/backward pictures in display order. Reference images (reference pictures) can be determined for each block serving as a basic coding/decoding unit. Distinction shall be made between such reference pictures by calling a reference picture to be described earlier in a coded bitstream as a first reference picture, and by calling a reference picture to be described later in the bitstream as a second reference picture. Note that as a condition for coding and decoding these types of pictures, pictures used for reference are required to be already coded and decoded.
P pictures and B pictures are coded using motion compensated inter picture prediction. Coding by use of motion compensated inter picture prediction is a coding method that employs motion compensation in inter picture prediction coding. Unlike a method for performing prediction simply based on pixel values in a reference picture, motion estimation is a technique capable of improving prediction accuracy as well as reducing the amount of data by estimating the amount of motion (hereinafter referred to as “motion vector”) of each part within a picture and further by performing prediction in consideration of such amount of motion. For example, it is possible to reduce the amount of data through motion compensation by estimating motion vectors of the current picture to be coded and then by coding prediction residuals between prediction values obtained by shifting only the amount of the respective motion vectors and the current picture to be coded. In this technique, motion vectors are also recorded or transmitted in coded form, since motion vector information is required at the time of decoding.
Motion vectors are estimated on a per-macroblock basis. More specifically, a macroblock shall be previously fixed in the current picture to be coded, so as to estimate motion vectors by finding the position of the most similar reference block of such fixed macroblock within the search area in a reference picture.
FIG. 1 is a diagram illustrating an example data structure of a bitstream. As FIG. 1 shows, the bitstream has a hierarchical structure such as below. The bitstream (Stream) is formed of more than one group of pictures (GOP). By using GOPs as basic coding units, it becomes possible to edit a moving picture as well as to make a random access. Each GOP is made up of plural pictures, each of which is one of I picture, P picture, and B picture. Each picture is further made up of plural slices. Each slice, which is a strip-shaped area within each picture, is made up of plural macroblocks. Moreover, each stream, GOP, picture, and slice includes a synchronization signal (sync) for indicating the ending point of each unit and a header (header) which is data common to said each unit.
Note that when data is carried not in a bitstream being a sequence of streams, but in a packet and the like being a piecemeal unit, the header and the data portion, which is the other part than the header, may be carried separately. In such a case, the header and the data portion shall not be incorporated into the same bitstream, as shown in FIG. 1. In the case of a packet, however, even when the header and the data portion are not transmitted contiguously, it is simply that the header corresponding to the data portion is carried in another packet. Therefore, even when the header and the data portion are not incorporated into the same bitstream, the concept of a coded bitstream described with reference to FIG. 1 is also applicable to packets.
Generally speaking, the human sense of vision is more sensitive to the low frequency components than to the high frequency components. Furthermore, since the energy of the low frequency components in a picture signal is larger than that of the high frequency components, picture coding is performed in order from the low frequency components to the high frequency components. As a result, the number of bits required for coding the low frequency components is larger than that required for the high frequency components.
In view of the above points, the existing coding methods use larger quantization steps for the high frequency components than for the low frequency components when quantizing transformation coefficients, which are obtained by orthogonal transformation, of the respective frequencies. This technique has made it possible for the conventional coding methods to achieve a large increase in compression ratio with a small loss of picture quality from the standpoint of viewers.
Meanwhile, since quantization step sizes of the high frequency components with regard to the low frequency components depend on picture signal, a technique for changing the sizes of quantization steps for the respective frequency components on a picture-by-picture basis has been conventionally employed. A quantization matrix is used to derive quantization steps of the respective frequency components. FIG. 2 shows an example quantization matrix. In this drawing, the upper left component is a direct current component, whereas rightward components are horizontal high frequency components and downward components are vertical high frequency components. The quantization matrix in FIG. 2 also indicates that a larger quantization step is applied to a larger value. Usually, it is possible to use different quantization matrices for each picture, and the matrix to be used is described in each picture header. Therefore, even if the same quantization matrix is used for all the pictures, it is described in each picture header and carried one by one.
Meanwhile, current MPEG-4 AVC does not include quantization matrix as in MPEG-2 and MPEG-4. This results in difficulty in achieving optimal subjective quality in the current MPEG-4 AVC coding scheme and other schemes using uniform quantization in all DCT or DCT-like coefficients. When such quantization matrix scheme is introduced, we have to allow the current provision of MPEG-4 AVC or other standards to carry the quantization matrices, in consideration of compatibility with the existing standards.
Additionally, because of the coding efficiency improvement, MPEG-4 AVC has been able to provide the potential to be used in various application domains. The versatility warrants the use of different sets of quantization matrices for different applications; different sets of quantization matrices for different color channels, etc. Encoders can select different quantization matrices depending on application or image to be coded. Because of that, we must develop an efficient quantization matrix definition and loading protocol to facilitate the flexible yet effective transmission of quantization matrix information.