1. Field of the Invention
The present invention relates to an image processing apparatus and method therefor. More specifically, the present invention related to an image processing apparatus for encoding and decoding image data and to a method of encoding and decoding the same.
2. Related Background Art
JPEG (Joint Photographic Coding Experts Group), H.261, and its improved MPEG (Moving Picture Experts Group) exist as international standards for the encoding of sound and image data. To handle integrated sounds and images in the current multi-media age, MPEG has been improved to MPEG1, and MPEG1 has undergone further improvement to MPEG2, both of which are currently in widespread use.
MPEG2 is the standard for moving picture encoding which is developed to respond to the demands for high image quality. Specifically:
(1) it can be used for applications ranging from communications to broadcasting, in addition to stored media data,
(2) it can be used for images with much higher quality than standard television, with possibility of extension in High Definition Television (HDTV),
(3) unlike MPEG1 and H.261, which can only be used with non-interlaced image date, MPEG2 can be used to encode interlaced images,
(4) it possesses scalability, and
(5) an MPEG2 decoder is able to process an MPEG1 bit stream; in other words, it is downwardly compatible.
Of the five characteristics listed, especially, item (4), scalability, is new to MPEG2, and roughly classified into three types, spatial scalability, temporal scalability, and signal to noise ratio (SNR) scalability, which are outlined below.
Spatial Scalability
FIG. 1 shows an outline of spatial scalability encoding. The base layer has a small temporal resolution, while the enhancement layer has a large temporal resolution.
The base layer consists of spatial sub-sampling of the original image at a fixed ratio, lowering the spatial resolution (image quality), and reducing the encoding volume per frame. In other words, it is a layer with a lower spatial resolution image quality and less code amount. Encoding takes place by using interframe prediction encoding within the base layer. This means that the image can be decoded from only the base layer.
On the other hand, the enhancement layer has a high image quality for spatial resolution and large code amount. The base layer image data is up-sampled (averaging, for example, is used to add a pixel between pixels in the low resolution image, creating a high resolution image) to generate an expanded base layer with the same size as the enhancement layer. Encoding takes place using not only predictions from an image within the enhancement layer, but also predictions taken from the up-sampled expanded image. Therefore it is not possible to decode the image from only the enhancement layer.
By decoding image data of the enhancement layer, encoded as described above, an image with the same spatial size as the original image is obtained, the image quality depending upon the rate of compression.
The use of spatial scalability allows two image sequences to be efficiently encoded, as compared to encoding and sending each image separately.
Temporal Scalability
FIG. 2 shows an outline of temporal scalability encoding. The base layer has a small temporal resolution, while the enhancement layer has a large temporal resolution.
The base layer has a temporal resolution (frame rate) that has been provided by thinning out the original image on a frame basis at a constant rate, thereby lowering the temporal resolution and reducing the amount of encoded data to be transmitted. In other words, it is a layer with a lower image quality for temporal resolution and less code amount. Encoding takes place using inter-frame prediction encoding within the base layer. This means that the image can be decoded from only the base layer.
On the other hand, the enhancement layer has a high image quality for temporal resolution and large code amount. Encoding takes place using prediction from not only I, P, B pictures within the enhancement layer, but also the base layer image data. Therefore it is not possible to decode the image from only the enhancement layer.
By decoding image data of the enhancement layer, encoded as described above, an image with the same frame rate as the original image is obtained, the image quality depending upon the rate of compression.
Temporal scaling allows, for example, a 30 Hz non-interlaced image and a 60 Hz non-interlaced image to be sent efficiently at the same time.
Temporal scalability is currently not in use. It is part of a future expansion of MPEG2 (treated as “reserved”).
SNR Scalability
FIG. 3 shows an outline of SNR scalability encoding.
The layer having a low image quality is referred to as a base layer, whereas the layer having a high image quality is referred to as an enhancement layer.
The base layer is provided, in the process of encoding (compressing) the original data, for example, in dividing it into blocks, DC-AC converting, quantizing and variable length encoding, by compressing the original image at relatively high compression rate (rough quantum step size) to result in less code amount. That is, the base layer is a layer with a low image quality, in terms of (N/S) image quality, and less code amount. In this base layer, encoding is carried out using MPEG1 or MPEG2 (with predictive encoding) decided to each frame.
On the other hand, the enhancement layer has a higher quality larger code amount than the base layer. The enhancement layer is provided by decoding an encoded image in the base layer, subtracting the decoded image from the original image, and intraframe encoding only the subtraction result at a relatively low compression rate (with a quantizing step size smaller than in the base layer). All encoding in SNR scaling takes place within the frame (field). No inter-frame (inter-field) prediction encoding is used. The entire encoding sequence is performed intra-frame (intra-field).
Using SNR scalability allows two types of images with differing picture quality to be encoded or decoded efficiently at the same time.
However, previous designs of encoding devices is not provided an option to freely select the size of the base layer image in spatial scalability. The image size of the base layer is a function of relationship between the enhancement layer and the base layer, and hence is not allowed to vary.
In addition, SNR scalability devices have faced similar limitations. The base layer frame rate is determined uniquely as a function of the enhancement layer, and the size of the base layer image could not be freely selected.
Therefore, previous encoding devices have not allowed one to select code amount, such as an image size and a frame rate, when using the scalability function. One could not select any factor directly related to the condition of the decoding device or the lines on the output side.
In other words, when an encoded image data is output from an encoding device employing spatial scalability or SNR scalability to a decoding device (receiving side), image quality choices are limited to:
1) a low quality image decoded from the base layer only, or
2) a high quality image provided by decoding both the base layer and the enhancement layer.
Accordingly, there is no opportunity to select image quality (decoding speed) in accordance with the capabilities of the decoding device or the needs of an individual user, which is a problem not addressed previously.
In addition, recent advances have taken place in the imaging field related to object encoding. MPEG4, currently being advanced as the imaging technology standard, is a good example. MPEG4 splits up one image into a background and several objects which exist in front of that background, and then encodes each of the different parts independently. Object encoding enjoys many benefits.
If the background is a relatively static environment and only some of the objects in the foreground are undergoing motion, then the background and all objects that do not move do not need to be re-encoded. Only the object which is moving is re-encoded. The amount of codes generated by re-encoding, that is, the amount of codes generated in encoding of the next image frame, is greatly reduced, and transmission of a very high quality image at a low transfer rate can be attained.
In addition, computer graphics (CG) can be used to provide an object image. In this case, the encoder only needs to encode the CG mesh (position and shape change) data, further contributing to the slimming down of the transfer code amount.
On the decoder side, the mesh data can be used to construct the image through computation to incorporate the constructed image into a picture. Using face animation as an example of CG, the eyes, nose, and other object data and their shape change information, received from the encoder, can be used by the decoder to perform operation on the characteristic data in the decoder, and then the updating operation to include the new data into the image can be carried out, thereby forming the animation.
Until now, when decoding an encoded image data at an image display terminal, the hierarchical degree at which a decoding process would take place has been fixed. For that reason, there has no selectability or possibility to change the hierarchy of the object to be displayed. Accordingly, this has not led to a high performance processing that meets with the processing capabilities of the terminal. Optimal decoding that makes use of the capabilities of the decoder, in relation to the encoded image data changing with time from the encoder, has not been possible.
In addition, encoding and decoding of CG data has been generally considered a process that is best handled in software, not hardware, and there are many examples of such software processes. Therefore, if the number of objects within one frame of an image increases, the hardware load on the decoder rapidly increases, but if the objects are face animation or similar CG data, then the software load (operation volume, operation time) will grow large.
A face object visual standard is defined for the encoding of face images in CG. In MPEG4, a face definition parameter (FDP), defining shape and texture of the facial image, and a face animation parameter (FAP), used to express the motions of the face, eyebrows, eyelids, eyes, nose, lips, teeth, tongue, cheeks, chin, etc., are used as standards.
A face animation is made by processing the FDP and FAP data and combining the results, thereby creating a larger load for the decoder than decoding by using encoded natural image data. The performance of the decoder may lead to obstacles such as the inability to decode, which can in turn lead to image quality problems such as object freeze and incompleteness.