The present invention relates to an image data encoding/decoding method and apparatus and, more particularly, to an encoding/decoding method and apparatus which can decode encoded image data in units of pixels.
As a conventional method of generating a three-dimensional image at an arbitrary viewpoint position, a method of expressing a three-dimensional object by a plurality of small triangular planes (polygons), and computing the luminance values of respective polygons on the basis of a given viewpoint is well known.
However, as the shape of a three-dimensional object becomes complicated, it becomes harder to express the object using polygons. In such case, even when smaller polygons are used, visual disturbance cannot be eliminated. As polygons becomes smaller, the data volume for expressing the object and the computation volume required for generating a three-dimensional image at an arbitrary viewpoint increase.
On the other hand, as a method of generating a three-dimensional image, in which the data volume does not depend on the object shape, a method using a ray space is known. This method is one of methods for generating and displaying an image at an arbitrary viewpoint position using a group of images actually captured at a plurality of viewpoint positions, and is based on the ray space concept that defines a three-dimensional object as a set of light rays propagating in a three-dimensional space.
According to this concept, since an image of a three-dimensional object viewed from an arbitrary viewpoint is generated by computing the luminance values of pixels that form a visible area of the three-dimensional object, the computation volume upon expressing the object depends only on the number of pixels that express the visible area and does not depend on its shape. Since the shape can be expressed by pixels, an image of even an object with a complicated shape can be accurately reconstructed. In addition, since actually captured images are used, a virtual space with high reality, which cannot be obtained by a method based on three-dimensional geometric models can be expressed.
The concept of a ray space will be further explained below. In a three-dimensional space, light rays coming from a light source, light rays reflected by objects, and the like exist. A light ray that passes through a given point in the three-dimensional space is uniquely defined by five variables that express its position (x, y, z) and direction (xcex8, xcfx86). If a function that represents the light intensity of this light ray is defined as f, light ray group data in the three-dimensional space can be expressed by f(x, y, z, xcex8, xcfx86). Furthermore, if a change over time in light ray group data is taken into consideration, that group data is expressed by f(x, y, z, xcex8, xcfx86, t), i.e., the light ray group in the three-dimensional space is described as a six-dimensional space. This space is called a xe2x80x9cray spacexe2x80x9d.
A light ray group that passes through a plane (reference plane) Z=0 for t=0 will be examined. If a horizontal plane (X-Z plane) perpendicular to the Y-axis is considered, and disparity in the vertical direction is ignored (y=0), a real space is expressed as shown in FIG. 13 for respective values of xcfx86. A light ray group coming from the reference plane is described by f(x, xcex8) using two variables, i.e., position x and angle xcex8. Therefore, a light ray group that passes a given point P(X, 0, Z) in the real space satisfies for each xcfx86:
X=x+Z tan xcex8xe2x80x83xe2x80x83(1)
If a variable u=tanxcex8 is defined, equation (1) is rewritten as:
X=x+uZxe2x80x83xe2x80x83(2)
Therefore, in the ray space, a single light ray in the real space is mapped onto one point, and the light intensity, i.e., color information, is recorded there. Also, as can be seen from equation (2), a light ray group that passes through a certain point in the real space is mapped onto a straight line in the x-u space.
FIG. 14 shows the state wherein light rays observed at a viewpoint position P(X, 0, Z) in the real space, and light rays observed from other viewpoint positions are mapped in the x-u space. Note that the x-u space forms a partial space of the aforementioned five-dimensional ray space. In this manner, when an image is captured from a sufficiently large number of viewpoints, the x-u space can be densely filled with data.
In order to accurately reconstruct an image at an arbitrary viewpoint position from this ray space, the y-axis direction, i.e., a dimension in the vertical direction, is required. However, in this case, ray space data must form at least a four-dimensional space x-y-xcex8-xcfx86, and has a very large data size. Hence, conventionally, only the x-u space as a partial space of the ray space is considered. Furthermore, it is very redundant to provide color information to the entire coordinate system of the ray space. Because, even when only the x-u space is considered, pixel information in the y-axis direction is required to reconstruct an image, a three-dimensional ray space must be formed, and the light intensity of each light ray must be recorded there. To overcome this problem, a method of obtaining luminance values from multi-viewpoint images (images captured from a plurality of different viewpoint positions) loaded onto a memory by making ray space computations for all pixels of the image to be reconstructed is proposed. Note that the ray space computation is a computation made based on equation (2) in the x-u space to reconstruct an image at an arbitrary viewpoint position on the basis of multi-viewpoint images.
However, in the prior art, since the x-u space considers only disparity in the x-axis direction (horizontal direction), an identical ray space computation must be repeated for all scan lines in the y-axis direction. In order to generate and display an image at an arbitrary viewpoint position in real time in correspondence with motions of the operator, high-speed ray space computations are required. In order to implement such computations, operations for randomly accessing multi-viewpoint images and reading pixel data must be done. That is, high-speed random access to multi-viewpoint images is required. Hence, in the aforementioned example, the x-u space and multi-viewpoint images are loaded onto the memory in advance.
In this fashion, conventionally, upon generating and displaying an image at an arbitrary viewpoint position, an identical ray space computation must be repeated, and a work memory having a very large size must be used. A large number of times of computations required for obtaining pixel data often impair real-time motions. Also, when ray space data that describes an object have a huge data size and all such data must be loaded onto the memory, the number of objects that can be expressed in a three-dimensional virtual environment using the ray space is limited. In order to display an image in real time, repetitive computations must be avoided, and in order to lay out many objects described using ray space data in a three-dimensional virtual environment, the work memory size occupied by the ray space data must be minimized.
For this reason, as described in, e.g., Japanese Patent Laid-Open No. 10-111951, a method of encoding multi-viewpoint data, which are captured to generate an arbitrary viewpoint image of a given three-dimensional object, to reduce its data size has been proposed.
In order to generate three-dimensional object image data at an arbitrary viewpoint using the ray space theory, multi-viewpoint image data obtained by sensing that object through 360xc2x0 are required. For example, a three-dimensional object is placed on a turntable, and its image is captured every time the object is rotated a predetermined angle in the horizontal direction, thus preparing multi-viewpoint image data for 360xc2x0. In order to generate data when the object is viewed from the above and below, multi-viewpoint images are captured by rotating the object also in the vertical direction.
Hence, as the predetermined angle is smaller, images with high correlation can be captured successively. In the method of Japanese Patent Laid-Open No. 10-111951, which exploits such high correlation, reference images, which are not encoded, of multi-viewpoint image data before compression (encoding) are periodically set, and each of remaining image data is encoded to pointer values indicating pixels having the closest values of pixel data included in two reference images which have strong correlation (close image sensing angles) to that image data, thus reducing the total size of image data.
However, in the encoding method described in Japanese Patent Laid-Open No. 10-111951, since each reference image is raw data which is not encoded, if many reference images are set, the image data size reduction effect becomes low. However, when the number of reference images is reduced, the number of image data having low correlation to the reference images increases, and the quality of a decoded image deteriorates. Hence, in such case, the practical data size reduction effect is not so high.
On the other hand, since successive image data having strong correlation can be considered as moving image data like a television signal or video signal, a known moving image data encoding method may be applied. However, since MPEG as a standard moving image data encoding method encodes and decodes image data in units of blocks, it cannot be directly applied to image generation based on the ray space theory that must extract decoding results from different multi-viewpoint images in units of pixels at high speed.
The present invention has been made in consideration of the conventional problems, and has as its object to provide an image data encoding/decoding method and apparatus, which can efficiently encode multi-viewpoint image data obtained by sensing a single object from many viewpoints, and can obtain decoded data in units of pixels at high speed.
More specifically, the gist of the present invention lies in a method of encoding an image data group including a plurality of image data, comprising: the reference image selection step of selecting a predetermined number of reference image(s) from the plurality of image data; the orthogonal transform step of computing an orthogonal transform of each of image data other than the reference images in units of blocks each having a predetermined size; the step of selecting a preset number of data as data to be encoded from the data obtained after the orthogonal transforms are computed; and the encoding step of fixed-length encoding and outputting the data to be encoded as encoded image data.
Another gist of the present invention lies in an image data decoding method for decoding encoded image data in units of fixed-length encoding for the data on the basis of the encoded image data which have undergone fixed-length encoding after orthogonal transform in units of blocks, and inverse transform formulas of the orthogonal transform, which are prepared in advance in units of pixels in each block, comprising: the first decoding step of decoding the fixed-length encoded data; the number determination step of determining the number of coefficients used in the formulas from coefficients of the orthogonal transform obtained in the first decoding step; and the second decoding step of decoding pixel data by applying the number of coefficients determined in the number determination step to the inverse transform formulas.
Still another gist of the present invention lies in an apparatus for encoding an image data group including a plurality of image data, comprising: reference image selection means for selecting a predetermined number of reference image(s) from the plurality of image data; orthogonal transform means for computing an orthogonal transform of each of image data other than the reference images in units of blocks each having a predetermined size; selection means for selecting a preset number of data as data to be encoded from the data obtained after the orthogonal transforms are computed; and encoding means for fixed-length encoding and outputting the data to be encoded as encoded image data.
Still another gist of the present invention lies in an image data decoding apparatus for decoding encoded image data in units of fixed-length encoding for the data on the basis of the encoded image data which have undergone fixed-length encoding after orthogonal transform in units of blocks, and inverse transform formulas of the orthogonal transform, which are prepared in advance in units of pixels in each block, comprising: first decoding means for decoding the fixed-length encoded data; number determination means for determining the number of coefficients used in the formulas from coefficients of the orthogonal transform obtained by the first decoding means; and second decoding means for decoding pixel data by applying the number of coefficients determined by the number determination means to the inverse transform formulas.
Still another gist of the present invention lies in a virtual image generation apparatus comprising: table generation means for mapping in a ray space a plurality of pixels included in a predetermined area of each of a plurality of image data obtained by sensing an identical object from different viewpoints, and generating a table indicating a correspondence between coordinates in the ray space and pixel positions in the image data; reference image selection means for selecting a predetermined number of reference image(s) from the plurality of image data; orthogonal transform means for computing an orthogonal transform of each of image data other than the reference images in units of blocks each having a predetermined size; formula generation means for generating an inverse formula of the orthogonal transform in units of pixels that form the block; encoding means for fixed-length encoding the image data that have undergone the orthogonal transform and outputting the transformed image data as encoded image data; light ray conversion means for converting the object into a light ray group on the basis of externally supplied data indicating a viewpoint position and direction; pixel position detection means for detecting a pixel position of each of light rays included in the converted light ray group in the corresponding image data with reference to the table; first decoding means for decoding the fixed-length encoded data corresponding to the pixel position detected by the pixel position detection means; number determination means for determining the number of coefficients used in the formulas from coefficients of the orthogonal transform obtained by the first decoding means; second decoding means for decoding pixel data by applying the number of coefficients determined by the number determination means to the inverse transform formulas; and image generation means for generating an image of the object viewed from the viewpoint position and direction on the basis of the decoded pixel data.
Still another gist of the present invention lies in an mixed reality space presentation system having: viewpoint position information acquisition means for acquiring a viewpoint position and direction of a user; and display means for presenting to the user an mixed reality space obtained by mixing a real space and a virtual space image, wherein an image of an object viewed from the viewpoint position and direction of the user is generated using a virtual image generation apparatus, and is displayed on the display means.
Still another gist of the present invention lies in a storage medium which stores, as a program that can be executed by a computer apparatus, an image data encoding method and/or image data decoding method according to the present invention.
Still another gist of the present invention lies in an image encoding apparatus comprising: selection means for selecting a predetermined image as a reference image from a plurality of continuous images in accordance with a predetermined relationship; block matching means for performing block matching between the reference image and an image except for the reference image; first data generation means for generating first data made up of identification data representing the reference image used in matching by said block matching means and position data representing a positional relationship with a matching block with respect to the reference image of a block of the image; second data generation means for generating second data representing a difference between the reference image and the image; and encoding means for quantizing the second data into a fixed-length code and outputting the fixed-length code.