The present invention pertains to the field of art for digital video processing and relates particularly to a video coding device for encoding video data at a high efficiency and a video decoding device for decoding coded data prepared by said video coding device at a high efficiency.
There has been proposed a video coding method which is capable of encoding a specified area to be of a higher image quality than that of other areas.
A video coding method described in references ISO/IEC JTC1/SC29/WG11 MPEG95/030 is such that selects a specified area and makes it (hereinafter referred to as selected area) encoded to have a higher image quality by controlling quantizer step sizes and time resolution.
Another conventional method shows an area-selecting portion intended to select a specified area of a video image. In case of selecting, e.g., a face area of a video image on a display of a video telephone, it is possible to select an area by using a method that is described in a reference material "Real-time auto face-tracking system (The Institute of Image Electronics Engineers of Japan, Previewing Report of Society Meeting, 93-04-04, pp. 13-16, 1993).
An area-position-and-shape coding portion encodes a position and a shape of a selected area. An optional shape may be encoded by using, e.g., chain codes. The coded position and shape are assembled into coded data and transferred or accumulated by a coded-data integrating portion.
A coded-parameter adjusting portion adjusts a variety of parameters usable for controlling image quality or data amount in video encoding so that the area-position-and-shape coding position may encode a selected area to get a higher image quality than that of other areas.
A parameter coding portion encodes a variety of adjusted parameters. The coded parameters are assembled into coded data and transferred or accumulated by a coded-data integrating portion. The video coding portion encodes input video data by using a variety of the parameters by a combination of conventional coding methods such as motion compensative prediction, orthogonal transformation, quantization and variable length coding. The coded video data is assembled into a coded data by the coded data integrating portion, then the coded data is transferred or accumulated.
The selected area is thus encoded to have a higher image quality than that of other areas.
As mentioned above, the conventional art improves quality of a selected area image by allocating more quantity of bits thereto by adjusting parameters such as quantizer step sizes, spatial resolution, time resolution. The conventional art, however, includes such problems that it can not obtain a specified area image by decoding a part of decoded data and/or obtain a decoded area image having a relatively low quality because of a selected area and other areas being included in the same group of coded data. Recently, many studies have been made on hierarchical structure of coded data but have not succeeded in creating a system that allows the selection of a specified area.
There has been studied a video coding method which is adapted to synthesize different kinds of video sequences.
A paper "Image coding using hierarchical representation and multiple templates" appeared in Technical Report of IEICE (Institute of Electronics Information and Communication Engineers) IE94-159, pp. 99-106, 1995, describes such an image synthesizing method that combines a video-sequence being a background video and a part-video-sequence being a foreground video (e.g., a figure image or a fish image cut-out by using the chromakey technique) to produce a new sequence.
In a conventional method, a first video-sequence is assumed to be a background video and a second video-sequence is assumed to be a part video. An alpha plane is weight data used when synthesizing a part image with a background image in a moving picture (video) sequence. There has been proposed an exemplified image made of pixels weighted with values of 1 to 0. The alpha-plane data is assumed to be 1 within a part and 0 out of a part. The alpha data may have a value of 0 to 1 in a boundary portion between a part and the outside thereof in order to indicate a mixed state of pixel values in the boundary portion and transparency of transparent substance such as glass.
In the conventional method, a first video-coding portion encodes the first video-sequence and a second video-coding portion encodes the second video-sequence according to an international standard video-coding system, e.g., MPEG or H.261. An alpha-plane coding portion encodes an alpha-plane. In the above-mentioned paper, this portion uses the techniques of vector value quantizing and Haar transformation. A coded-data integrating portion (not shown) integrates coded data received from the coding portions and accumulates or transmits the integrated coded data.
In the decoding device of the conventional method, a coded-data dissembling portion (not shown) disassembles coded data into the coded data of the first video-sequence, the coded data of the second video-sequence and the coded data of the alpha-plane, which are then decoded respectively by a first video-decoding portion, a second video-decoding portion and an alpha-plane decoding portion. Two decoded sequences are synthesized according to weighted mean values by a first weighting portion, a second weighting portion and adder. The first video-sequence and the second video-sequence are combined according to the following equation: EQU f(x,y,t)=(1-.alpha.(x,y,t))f1(x,y,t)+.alpha.(x,y,t)f2(x,y,t)
In the equation, (x,y) represents coordinate data of an intraframe pixel position, t denotes a frame time, f1(x,y,t) represents a pixel value of the first video sequence, f2(x,y,t) represents a pixel value of the second video sequence, f(x,y,t) represents a pixel value of the synthesized video sequence and .alpha.(x,y,t) represents alpha-plane data. Namely, the first weighting portion uses 1-.alpha.(x,y,t) as a weight while the second weighting portion uses .alpha.(x,y,t) as a weight. As mentioned above, the conventional method produces a large number of coded data because it must encode alpha-plane data.
To avoid this problem, saving the information amount by binarizing alpha-plane data may be considered, but it is accompanied by such a visual defect that tooth-like line appears at the boundary between a part image and a background as the result of discontinuous change of pixel values thereabout.
There has been studied a video coding method that is adapted to synthesize different kinds of video sequences.
A paper "Image coding using hierarchical representation and multiple templates" appeared in Technical Report of IEICE IE94-159, pp. 99-106, 1995, describes such an image synthesizing method that combines a video-sequence being a background video and a part-video-sequence being a foreground video (e.g., a figure image or a fish image cut-out by using the chromakey technique) to produce a new sequence.
A paper "Temporal Scalability based on image content" (ISO/IEC JTC1/SC29/WG11 MPEG95/211, (1995)) describes a technique for preparing a new video-sequence by synthesizing a part-video sequence of a high frame rate with a video-sequence of a low frame rate. This system is to encode an lower-layer frame at a low frame-rate by prediction coding method and to encode only a selected area of an upper-layer frame at a high frame rate by prediction coding. The upper layer does not encode a frame coded at the lower-layer and uses a copy of the decoded image of the lower-layer. The selected area may be considered to be a remarkable part of image, e.g., a human figure.
In a conventional method, at the coding side, an input video-sequence is thinned by a first thinning portion and a second thinning portion and the thinned video-sequence with a reduced frame rate is then transferred to an upper-layer coding portion and an lower-layer coding portion respectively. The upper-layer coding portion has a frame rate higher than that of the lower-layer coding portion.
The lower-layer coding portion encodes a whole image of each frame in the received video-sequence by using an international standard video-coding method such as MPEG, H.261 and so on. The lower-layer coding portion also prepares decoded frames which are used for prediction coding and, at the same time, are inputted into a synthesizing portion.
In a code-amount control portion of a conventional coding portion, a coding portion encodes video frames by using a method or a combination of methods such as motion compensative prediction, orthogonal transformation, quantization, variable length coding and so on. A quantization-width (step-size) determining portion determines a quantization width (step size) to be used in a coding portion. A coded-data amount determining portion calculates an accumulated amount of generated coded data. Generally, the quantization width is increased or decreased to prevent increase or decrease of coded data amount.
The upper-layer coding portion encodes only a selected part of each frame in a received video-sequence on the basis of an area information by using an international standard video-coding method such as MPEG, H.261 and so on. However, frames encoded at the lower-layer coding portion are not encoded by the upper-layer coding portion. The area information is information indicating a selected area of, e.g., an image of a human figure in each video frame, which is a binarized image taking 1 in the selected area and 0 outside the selected area. The upper-layer coding portion also prepares decoded selected areas of each frame, which are transferred to the synthesizing portion.
An area-information coding portion encodes an area information by using 8-directional quantizing codes. The 8-directional quantizing code is a numeric code indicating a direction to a proceeding point and it is usually used for representing digital graphics.
A synthesizing portion outputs a decoded lower-layer video-frame which has been encoded by lower-layer coding portion and is to be synthesized. When a frame to be synthesized but has not been encoded at the lower-layer coding portion, the synthesizing portion outputs a decoded video-frame that is generated by using two decoded frames, which have been encoded at the lower-layer and stand before and after the lacking lower-layer frame, and one decoded upper-layer frame to be synthesized. The two lower-layer frames stand before and after the upper-layer frame. The synthesized video-frame is inputted into the upper-layer coding portion to be used therein for predictive coding. The image processing in the synthesizing portion is as follows:
An interpolating image is first prepared for two lower-layer frames. A decoded image of the lower-layer at time t is expressed as B(x,y,t), where x and y are co-ordinates defining the position of a pixel in a space. When the two decoded images of the lower-layer are located at time t1 and t2 and the decoded image of the upper-layer is located at t3 (t1&lt;t3&lt;t2), the interpolating image I(x,y,t3) of time t3 is calculated according to the following equation (1): EQU I(x,y,t3)=[(t2-t3)B(x,y,t1)+(t3-t1)B(x,y,t2)]/(t2-t1) (1)
The decoded image E of the upper layer is then synthesized with the obtained interpolating image I by using synthesizing weight information W(x,y,t) prepared from area information. A synthesized image S is defined according to the following equation: EQU S(x,y,t)=[1-W(x,y,t)]I(x,y,t)+E(x,y,t)W(x,y,t) (2)
The area information M(x,y,t) is a binarized image taking 1 in a selected area and 0 outside the selected area. The weight information W(x,y,t) can be obtained by processing the above-mentioned binarized image several times with a low-pass filter. Namely, the weight information W(x,y,t) takes 1 within a selected area, 0 outside the selected area and a value of 0 to 1 at boundary of the selected area.
The coded data prepared by the lower-layer coding portion, the upper-layer coding portion and the area information coding portion is integrated by an integrating portion (not shown) and then is transmitted or accumulated.
In the decoding side of the conventional system, a coded data disassembling portion (not shown) separates coded data into lower-layer coded data, upper-layer coded data and area-information coded data. These coded data are decoded respectively by an lower-layer decoding portion, an upper-layer decoding portion and an area information decoding portion.
A synthesizing portion of the decoding side is similar in construction to the synthesizing portion. It synthesizes an image by using a decoded lower-layer image and a decoded upper-layer image according to the same method as described at the coding side. The synthesized video frame is displayed on a display screen and, at the same time, is inputted into the upper layer decoding portion to be used for prediction thereat.
The above-described decoding device decodes both lower-layer and the upper-layer frames, but a decoding device consisting of an lower-layer decoding portion is also applied, omitting the upper-layer coding portion and the synthesizing portion. This simplified decoding device can reproduce a part of coded data.
Problems to be solved by the present invention:
(1) As mentioned above, the conventional art obtains an output image from two lower-layer decoded images and one upper-layer decoded image by previously preparing an interpolating image of two lower-layer frames and, therefore, encounters such a problem that the output image may be considerably deteriorated with a large distortion occurred around a selected area therein if the position of said selected area changes with time.
The above-mentioned problem is described as follows:
Images A and C are two decoded lower-layer frames and an image B is a decoded upper-layer frame. The images are displayed in the time order A, B and C. Because the selected area moves, an interpolating image determined from the images A and B shows two selected areas overlapped with each other. The image B is further synthesized with the interpolating image by using weight information. An output image has three selected areas overlapped with each other. Two selected areas of the lower-layer image appear like-afterimage around the selected area image of the upper-layer, thereby considerably deteriorating the quality of an image. Because the lower-layer frames are normal and only synthesized frames have the above-mentioned distortion, the video sequence may be displayed with periodical flicker-like distortion that considerably impairs the video image-quality.
(2) The conventional art uses eight-directional quantizing codes for encoding area information. In case of encoding the area-information of a low bit-rate or of a complicated-shape area, an amount of coded area-information increases and takes a large portion of the total amount of coded data, that may cause the deterioration of the image quality.
(3) The conventional art obtains weight information by making the area information pass through a low-pass filter several times. This increases an amount of processing operations.
(4) The conventional art uses predictive coding method. However, the predictive coding the lower-layer frames may cause a large distortion if a screen change occurs in a video sequence. The distortion of any lower-layer frame may propagate a related upper-layer images, resulting in the prolonged distortion of the video.
(5) According to the conventional art, each lower-layer frame is encoded by using an international standard video-coding method (e.g., MPEG and H.261), thereby a selected area image differs little in quality from other areas. On the contrary, in each upper-layer frame, only a selected area is encoded to be of a high quality, thereby the quality of the selected area image may vary with time. This is sensed as a flicker-like distortion that is a problem.