1. Field of the Invention
The present invention relates to an image coding apparatus and an image coding method for generating a coded bit stream by coding images, and an image decoding apparatus and an image decoding method for receiving a coded bit stream and for decoding an image, signal contained in the coded bit stream, and more particularly to an image coding apparatus and an image coding method according to MPEG-4 for coding images on an object by object basis, and to an image decoding apparatus and an image decoding method according to MPEG-4 for decoding a coded bit stream generated by coding images on an object by object basis.
2. Background Art
Conventionally, as a method of coding or decoding an image signal, MPEG-4 (Moving Picture Experts Group Phase-4) is known which is currently in progress toward standardization by ISO/IEC JTC11/SC29/WG11, for example.
The MPEG-4 is a method that captures a moving picture sequence as a collection of moving image objects that take any shapes in time and space, and carries out coding and decoding based on individual moving image objects.
FIG. 1 shows a video data structure according to the MPEG-4 standard.
In the MPEG-4, a moving image object including a time axis is referred to as a Video Object (VO), components of the VO are each referred to as a Video Object Layer (VOL), components of the VOL are each called a Group of Video Object Plane (GOV), and the image data that represents the momentary state of the GOV and forms a unit of coding is called a Video Object Plane (VOP). For example, the VO corresponds to individual talkers and the background in a videoconference scene, the VOL is a unit of the talkers and the background with a particular temporal and spatial resolution, and the VOP is the momentary image data of the VOLs (corresponding to frames) The GOV is a data structure consisting of a plurality of VOPs, which is used as a unit of edition and random access, and not necessarily required in coding.
FIG. 2 shows a concrete example of the VOPs. FIG. 2 shows two VOPs, (VOP1 represents a man, and VOP2 represents a picture on a wall). Each VOP consists of texture data representing color gradation levels and geometric data representing the shape of the VOP. The texture data consists of 8-bit luminance signal and color difference signals (with a size ½ sub-sampled in the horizontal and vertical directions with respect to the luminance signal). The geometric data is binary matrix data that assigns “1” to the inside of the VOP and “0” to the outside thereof, and has the same image size as the luminance signal (although the geometric data has 8-bit width per pixel, and the inside of the VOP is assigned “255” and the outside assigned “0” in practice, it is assumed in the following that they are assigned the binary value “1” and “0” for convenience sake).
In the moving picture representation based on the VOPs, a conventional frame image is obtained by placing the plurality of VOPs in position in a picture. If the shape of the VOP is rectangular and time-invariant, the VOP becomes synonymous with the frame. In this case, the geometric data is absent, and only the texture data is coded.
FIG. 3 shows an example of a conventional coded bit stream. A bit string called a start code is placed at the initial positions of the VO, VOL, GOV and VOP headers and of the VOP data. The start code is a unique word (a bit string that can be interpreted uniquely) for indicating the beginning of the individual header information and VOP data information. The individual header information contains information required for decoding data in that and its lower layers, and information representing layer attribute. For example, the VOL header information contains information required for decoding the VOPs constituting the VOL. The VOP data consists of the image data divided into macroblocks, a unit block to be coded. Although the VOP data as shown in FIG. 3 does not usually include the start code, the start code can be added to every set of a plurality of macroblocks. The VOP header information contains coding type information as to whether the VOP is intra coded or inter coded. The intra coding refers to a coding mode that codes the VOP to be coded using only information about the VOP itself without using the information associated with other VOPs. In contrast, the inter coding refers to a coding mode that codes the information on the VOP using the information associated with previous and following VOPs.
With the foregoing structure, the conventional image coding apparatus and image decoding apparatus can identify the coding mode of the VOP data-only after it analyzes the coding type—information contained in the VOP header information in the coded bit stream. As a result, although the coding side codes the entire VOP data in such units as VOL, GOV or the like of the object using only the intra coding, the decoding side must analyze the header information of the individual VOPs to identify the coding mode applied to the VOPs.
Therefore, although the coding side codes the entire VOP data in the units like VOL or GOV of the object using only the intra coding, to achieve instantaneous access to a VOP at a desired time, or to carry out “frame skip control” for decimating image signal to be coded in accordance with the load of a decoder, the decoding side cannot identify the desired VOP to be accessed or the VOP to be decoded in the frame skip control until it recognizes the predictive structure and time information of the coded bit stream by analyzing the coded data of all the VOPs. This presents a problem of making the decoding processing difficult and prolonging the decoding.