The present invention relates to coding and decoding apparatus and method for recording a moving picture signal on a recording medium such as an optical disc or a magnetic tape and reproducing it for display on a display device. The present invention may be used in video conference systems, video telephone systems, broadcast equipment, multimedia database retrieval systems, and the like in such a manner that a moving picture signal is transmitted from a transmission side to a reception side via a transmission line and received and displayed on the reception side. The present invention may also be used for editing and recording a moving picture signal.
In a video conference system or a video telephone system in which a moving picture signal is transmitted to a remote place, to efficiently utilize a transmission line, an image signal is compressed/coded by utilizing line correlation or frame correlation of the video signal. In recent years, with improvement in computer processing, moving picture information terminals using a computer have become widespread. In such systems, information is transmitted to remote locations via a transmission line such as a network. In this case, to efficiently utilize the transmission line, a signal to be transmitted such as an image, sound, or computer data is transmitted after being compressed/coded. On a terminal side (reception side), the compressed/coded signal that has been transmitted is decoded by a predetermined decoding method corresponding to the encoding method into an original image, sound, or computer data, which is output by a display device, speakers, or the like of the terminal. Previously, the transmitted image signal or the like was merely output, as it is, on a display device. But in information terminals using a computer, a plurality of images, sounds, or computer data can be handled or displayed in a two-dimensional or three-dimensional space after being subjected to a given conversion process. This type of process can be realized in such a manner that information of a two-dimensional or three-dimensional space is described by a given method on a transmission side, and the terminal side (reception side) executes a conversion process on an image signal or the like according to the description.
A typical example for describing spatial information is VRML (Virtual Reality Modeling Language), which has been standardized by ISO-IEC/JTC1/SC24. The latest version VRML 2.0 is described in IS14772. VRML is a language for describing a three-dimensional space and defines data for describing attributes, shapes, etc. of a three-dimensional space. Such data is called a node. To describe a three-dimensional space, it is necessary to describe in advance how to combine the nodes. Each node includes data indicating color, texture, etc., data indicating polygon shapes, and other information.
In information terminals using a computer, a given object is generated by CG (computer graphics) according to a description of the above-mentioned VRML using polygons etc. With VRML, it is possible to attach a texture to a three-dimensional object that has been generated in this manner and that has been composed of polygons. A node called xe2x80x9cTexturexe2x80x9d is defined for still pictures and a node called xe2x80x9cMovieTexturexe2x80x9d is defined for moving pictures. Information (a file name, display start time or end time, etc.) on the texture to be attached is described in these nodes. Referring to FIG. 23, a texture attachment process (hereinafter referred to as a texture mapping process, where appropriate) will be described.
FIG. 23 shows an example of the configuration of texture mapping apparatus. As shown in FIG. 23, a memory group 200 includes a texture memory 200a, a gray scale memory 200b, and a three-dimensional object memory 200c. The texture memory 200a stores texture information that is input externally. The gray scale memory 200b and the three-dimensional object memory 200c store key data indicating the degree of penetration/transparency of the texture and three-dimensional object information that are also input externally. The three-dimensional object information is necessary for generation of polygons and is related to illumination. A rendering circuit 201 generates a three-dimensional object by generating polygons based on the three-dimensional object information that is stored in the three-dimensional object memory 200c of the memory group 200. Further, based on the three-dimensional object data, the rendering circuit 201 reads out the texture information and the key data indicating the degree of penetration/transparency of the texture from the memories 200a and 200b, respectively, and executes a superimposition process on the texture and a corresponding background image by referring-to the key data. The key data indicates the degree of penetration of the texture at a corresponding position, that is, the transparency of an object at the corresponding position.
A two-dimensional conversion circuit 202 outputs a two-dimensional image signal that is obtained by mapping the three-dimensional object that has been generated by the rendering circuit 201 to a two-dimensional plane based on view point information that is supplied externally. Where the texture is a moving picture, the above process is executed on a frame-by-frame basis.
With VRML, it is possible to handle, as texture information, data that has been compressed according to JPEG (Joint Photographic Experts Group) which is typically used in high-efficiency coding of a still picture, MPEG (Moving Picture Experts Group) for high-efficiency coding of a moving picture, or the like. Where an image so compressed is used as texture, the texture (image) is decoded by a decoding process corresponding to an encoding scheme. The decoded image is stored in the texture memory 200a of the memory group 200 and subjected to a process similar to the above process.
The rendering circuit 201 attaches the texture information that is stored in the texture memory 200a to an object at a given position regardless of the format of an image and whether the image is a moving picture or a still picture. Therefore, the texture that can be attached to a certain polygon is stored in one memory. In transmitting three-dimensional object information, it is necessary to transmit three-dimensional coordinates of each vertex. Real number data of 32 bits is needed for each coordinate component. Real number data of 32 bits or more is also needed for such attributes as reflection of each three-dimensional object. Therefore, information to be transmitted is enormous and further increases in transmitting a complex three-dimensional object or a moving picture. Therefore, in transmitting three-dimensional information as above or texture information via a transmission line, it is necessary to transmit compressed information for improving the transmission efficiency.
A typical example of high-efficiency coding (compression) schemes for a moving picture is the MPEG (Moving Picture Experts Group; moving picture coding for storage) scheme, which is discussed in ISO-IEC/JTC1/SC2/WG11 and was proposed as a standard. MPEG employs a hybrid scheme that is a combination of motion-compensation predictive coding and DCT (discrete cosine transform) coding. To accommodate various applications and functions, MPEG defines several profiles (classification of functions) and levels (quantities such as an image size). The most basic item is a main level of a main profile (MP@ML).
An example of configuration of an encoder (image signal coding apparatus) of MP@ML of the MPEG scheme will be described with reference to FIG. 24. An input image signal is first input to a frame memory 1, and then coded in a predetermined order. The image data to be coded is input to a motion vector detection circuit (ME) 2 on a macroblock basis. The motion vector detection circuit 2 processes image data of each frame as an I-picture, a P-picture, or a B-picture in accordance with a predetermined sequence. That is, it is predetermined whether images of respective frames that are input sequentially are processed as I, P, and B-pictures (for instance, they are processed in the order of I, B, P, B, P, . . . , B, P).
The motion-vector detection circuit 2 performs motion compensation by referring to a predetermined reference frame and detects its motion vector. The motion compensation (interframe prediction) has three prediction modes, that is, forward prediction, backward prediction, and bidirectional prediction. Only forward prediction is available as a P-picture prediction mode, and three prediction modes, that is, forward prediction, backward prediction, and bidirectional prediction are available as a B-picture prediction mode. The motion vector detection circuit 2 selects a prediction mode that minimizes the prediction error and generates a corresponding prediction vector.
The resulting prediction error is compared with, for instance, the variance of a macroblock to be coded. If the variance of the macroblock is smaller than the prediction error, no prediction is performed on the macroblock and intraframe coding is performed. In this case, the prediction mode is intra-image prediction (intra). A motion vector detected by the motion vector detection circuit 2 and the above-mentioned prediction mode are input to a variable-length coding circuit 6 and a motion compensation circuit (MC) 12. The motion compensation circuit 12 generates prediction image data based on a given motion vector and inputs it to operation circuits 3 and 10. The operation circuit 3 calculates difference data indicating a difference between the value of the macroblock to be coded and the value of the prediction image data and outputs a calculation result to a DCT circuit 4. In the case of an intra-macroblock mode, the operation circuit 3 outputs, as it is, the macroblock data to be coded to the DCT circuit 4.
The DCT circuit 4 converts the input data into DCT coefficients by subjecting the data to DCT (discrete cosine transform). The DCT coefficients are input to a quantization circuit (Q) 5, where they are quantized with a quantization step corresponding to a data storage amount (buffer storage amount) of a transmission buffer 7. Quantized coefficients (data) are input to the variable-length coding circuit 6.
The variable-length coding circuit 6 converts quantized data that is supplied from the quantization circuit 5 into a variable-length code such as a Huffman code. The variable-length coding circuit 6 also receives the quantization step (scale) from the quantization circuit 5 and the prediction mode (indicating which of intra-image prediction, forward prediction, backward prediction, and bidirectional prediction was set) and the motion vector from the motion vector detection circuit 2, and performs variable length coding thereon. The transmission buffer 7 temporarily stores received coded data and outputs a quantization control signal that corresponds to the storage amount to the quantization circuit 5. When the residual data amount has increased to the allowable upper limit, the transmission buffer 7 controls to reduce the data amount of quantization data by increasing the quantization scale of the quantization circuit 5 using the quantization control signal. Conversely, when the residual data amount has decreased to the allowable lower limit, the transmission buffer 7 controls to increase the data amount of quantization data by decreasing the quantization scale of the quantization circuit 5 using the quantization control signal. Overflow or underflow of the transmission circuit 7 is prevented in this manner. Coded data stored in the transmission buffer 7 is read out with predetermined timing and output as a bit stream to a transmission line. On the other hand, quantized data that is output from the quantization circuit 5 is input to a de-quantization circuit (IQ) 8, where it is de-quantized in accordance with a quantization step supplied from the quantization circuit 5. Output data (DCT coefficients) from the de-quantization circuit 8 is input to an IDCT (inverse DCT) circuit 9, then subjected to inverse DCT processing, and stored in a frame memory (FM) 11 via the operation circuit 10.
Next, an example of a decoder (image signal decoding apparatus) of MP@ML of MPEG will be described with reference to FIG. 25. Coded image data (bit stream) that has been transmitted via a transmission line is received by a receiving circuit (not shown), or reproduced by a reproduction circuit, temporarily stored in a reception buffer 21, and then supplied to a variable-length decoding circuit (IVLC) 22. Performing variable-length decoding on the data supplied from the reception buffer 21, the variable-length decoding circuit 22 outputs a motion vector and a prediction mode to a motion compensation circuit 27 and a quantization step to a de-quantization circuit 23. Further, the variable-length decoding circuit 22 outputs decoded quantized data to the de-quantization circuit 23. The de-quantization circuit 23 de-quantizes the quantized data that is supplied from the variable-length decoding circuit 22 in accordance with the quantization step also supplied from the variable-length decoding circuit 22, and outputs the resulting data (DCT coefficients) to an IDCT circuit 24. The data (DCT coefficients) that is output from the de-quantization circuit 23 is subjected to inverse DCT in the IDCT circuit 24 and supplied to an operation circuit 25 as output data. If the output data supplied from the IDCT circuit 24 (the input bit stream) is I-picture data, it is output from the operation circuit 25 as image data and then supplied to a frame memory 26 and stored there for generation of prediction image data for image data (P or B-picture data) that will be input to the operation circuit 25. This image data is also output, as it is, to the external system as a reproduction image.
If the output data supplied from the IDCT circuit 24 (the input bit stream) is a P or B-picture, the motion compensation circuit 27 generates a prediction image based on the image data stored in the frame memory 26 in accordance with the motion vector and the prediction mode that are supplied from the variable-length decoding circuit 22, and outputs it to the operation circuit 25. The operation circuit 25 adds the output data that is supplied from the IDCT circuit 24 and the prediction image data that is supplied from the motion compensation circuit 27, to produce output image data. In the case of a P-picture, the output data of the operation circuit 25 is input to the frame memory 26 and stored there as prediction image data (a reference image) for an image signal to be subsequently decoded.
In MPEG, various profiles and levels other than MP@ML are defined and various tools are prepared. Scalability is one of those tools. In MPEG, the scalable coding scheme is introduced that realizes scalability for accommodating different image sizes and frame rates. For example, in the case of spatial scalability, an image signal having a small image size can be decoded by decoding only lower-layer bit streams, and an image signal having a large image size can be decoded by decoding lower-layer and upper-layer bit streams. An encoder of spatial scalability will be described with reference to FIG. 26. In the case of the spatial scalability, the lower layer corresponds to image signals having a small image size and the upper layer corresponds to image signals having a large size. A lower-layer image signal is first input to the frame memory 1 and then coded in the same manner as in the case of MP@ML. However, not only is the output of the operation circuit 10 supplied to the frame memory 11 used as a lower-layer prediction image data, but also it is used as an upper-layer prediction image data after being enlarged to the same image size as the upper-layer image size by an image enlargement circuit (up sampling) 31. According to FIG. 26, an upper-layer image signal is input to a frame memory 51. A motion vector detection circuit 52 determines a motion vector and a prediction mode in the same manner as in the case of MP@ML. A motion compensation circuit 62 generates prediction image data in accordance with the motion vector and the prediction mode that have been determined by the motion vector detection circuit 52 and outputs it to a weighting circuit (W) 34. The weighting circuit 34 multiplies the prediction image data by a weight W and outputs the weighted prediction image data to an operation circuit 33.
As described above, output data (image data) of the operation circuit 10 is input to the image enlargement circuit 31. The image enlargement circuit 31 enlarges the image data that has been generated by the operation circuit 10 to make its size equal to the upper-layer image size and outputs the enlarged image data to a weighting circuit (1xe2x88x92W) 32. The weighting circuit 32 multiplies the enlarged image data of the image enlargement circuit 31 by a weight (1xe2x88x92W) and outputs the result to the operation circuit 33. The operation circuit 33 adds the output data of the weighting circuits 32 and 34 and outputs the result to an operation circuit 53 as a prediction image data. The output data of the operation circuit 33 is also input to an operation circuit 60, added to output data of an inverse DCT circuit 59 there, and then input to a frame memory 61 for later use as a prediction image data for image data to be coded. The operation circuit 53 calculates a difference between the output data of the image data to be coded and the output data of the operation circuit 33, and outputs the result as difference data. However, in the case of intraframe coding macroblock, the operation circuit 53 outputs, as it is, the image data to be coded to a DCT circuit 54. The DCT circuit 54 performs DCT (discrete cosine transform) on the output of the operation circuit 53, to generate DCT coefficients, which are output to a quantization circuit 55. As in the case of MP@ML, the quantization circuit 55 quantizes the DCT coefficients in accordance with a quantization scale that is based on the data storage amount of a transmission buffer 57 and other factors, and outputs a result (quantized data) to a variable-length coding circuit 56. The variable-length coding circuit 56 performs variable-length coding on the quantized data (quantized DCT coefficients) and outputs a result as an upper-layer bit stream via the transmission buffer 57. The output data of the quantization circuit 55 is de-quantized by a de-quantization circuit 58 with the quantization scale that was used in the quantization circuit 55, subjected to inverse DCT in the inverse DCT circuit 59, and then input to the operation circuit 60. The operation circuit 60 adds the outputs of the operation circuit 33 and the inverse DCT circuit 59 and inputs a result to the frame memory 61. The variable-length coding circuit 56 also receives the motion vector and the prediction mode that were detected by the motion vector detection circuit 52, the quantization scale that was used in the quantization circuit 55, and the weights W that were used in the weighting circuits 32 and 34, which are coded in the variable-length coding circuit 56 and then transmitted.
Next, an example of a decoder of the spatial scalability will be described with reference to FIG. 27. A lower-layer bit stream is input to the reception buffer 21 and then decoded in the same manner as in the case of MP@ML. However, not only is the output of the operation circuit 25 output to the external system and stored in the frame memory 26 for use as a prediction image data for an image signal to be decoded later, but also it is used as an upper-layer prediction image data after being enlarged to the same image size as an upper-layer image size by an image signal enlargement circuit 81. An upper-layer bit stream is supplied to a variable-length decoding circuit 72 via a reception buffer 71, and a variable-length code is decoded there. That is, a quantization scale, a motion vector, a prediction mode, and a weighting coefficient (weight W) are decoded together with DCT coefficients. The DCT coefficients (quantized data) decoded by the variable-length decoding circuit 72 are de-quantized by a de-quantization circuit 73 by using the decoded quantization scale, subjected to inverse DCT in an inverse DCT circuit 74, and then supplied to an operation circuit 75.
A motion compensation circuit 77 generates prediction image data in accordance with the decoded motion vector and prediction mode and inputs it to a weighting circuit 84. The weighting circuit 84 multiplies the output of the motion compensation circuit 77 by the decoded weight W and outputs a result to an operation circuit 83. Not only is the output of the operation circuit 25 supplied as lower-layer reproduction image data and output to the frame memory 26, but also it is output to a weighting circuit 82 after being enlarged by the image signal enlargement circuit 81 so as to have the same image size as the upper-layer image size. The weighting circuit 82 multiplies the output of the image signal enlargement circuit 81 by (1xe2x88x92W) by using the decoded weight W, and outputs the result to the operation circuit 83. The operation circuit 83 adds the outputs of the weighting circuits 82 and 84 and outputs the result to the operation circuit 75. The operation circuit 75 adds the output of the inverse DCT circuit 74 and the output of the operation circuit 83, and outputs the result as upper-layer reproduction image data and also supplies it to the frame memory 76 for use as prediction image data for image data to be decoded later.
The above description is applied to a process for a luminance signal. A color difference signal is processed in a similar manner. The motion vector to be used in processing a color difference signal is obtained by halving a motion vector for a luminance signal in both vertical and horizontal directions.
While the MPEG scheme has been described above, other various high-efficiency coding schemes for a moving picture also have been standardized. For example, ITU-T (International Telecommunications Union) has standardized the schemes H.261 and H.263 as coding for communication. Basically, like the MPEG scheme, H.261 and H.263 are a combination of motion-compensation predictive coding and DCT coding. A coding apparatus and a decoding apparatus according to H.261 or H.263 are configured in the same manner as in the MPEG scheme though the details of header information etc. are different. Further, in the above-described MPEG scheme, the standardization of a new highly efficient coding scheme called MPEG4 is now underway. Major features of MPEG4 are that an image is coded on an object-by-object basis (an image is coded in units of a plurality of images) and that the image can be modified on the object-by-object basis. That is, on the decoding side, images of respective objects or a plurality of images can be combined to reconstruct one image.
In ISO-IEC/JTC1/SC29/WG11, as previously mentioned, the standardization work for MPEG4 is now underway. In this work, a scheme of handling a natural image and a computer graphics image within a common framework is being studied. In this scheme, a three-dimensional object is described by using VRML, and a moving picture and sound or audio are compressed according to the MPEG standard. A scene consisting of a plurality of three-dimensional objects, moving pictures, etc. is described according to VRML. The description of a scene (hereinafter abbreviated as a scene description), the description of a three-dimensional object, and AV data consisting of a moving image, sound or audio compressed according to the MPEG scheme, which have been obtained in the above manner, are given time stamps and multiplexed by a multiplexing circuit into a bit stream, which is transmitted as multiplexed bit stream. In a reception terminal that has received a multiplexed bit stream, a demultiplexing circuit extracts the scene description, the description of a three-dimensional object, and AV stream (a stream corresponding to AV data), decoders decode respective bit streams, and a scene that is reconstructed by a scene construction circuit is displayed on a display device.
In the above method, it is necessary to clarify a relationship between nodes that are described according to VRML (description of three-dimensional objects and scene description) and AV data of moving pictures, sounds, audio, etc. For example, it is necessary to indicate what AV stream should be texture-mapped with a certain three-dimensional object. In VRML, texture to be attached to (mapped with) a three-dimensional object is designated by a URL (Uniform Resource Locator which is a character string indicating a server on a network). This designation method corresponds to designation of the absolute address of an AV data file on the network. On the other hand, in a system according to the MPEG scheme, each AV stream is identified by designating its ID. This corresponds to designation of a relative path of a stream in a session (a communication line) when the session has been established. That is, in VRML, there is no method for identifying a stream other than using a URL. But an application of, for instance, an MPEG real time communication requires ID-based designation. There is a problem of incompatibility between the two schemes.
When viewed from another point, it can be said that VRML assumes a model in which a client requests information. On the other hand, MPEG assumes a model in which broadcast information or the like is transmitted under the control of a server. The difference in these models causes a problem that it is difficult to fuse together a computer graphics image and a natural image while compatibility with VRML2.0 is maintained.
The present invention has been made in view of the foregoing, and an object of the invention is therefore to enable a computer graphics image that is described according to VRML and an image or the like that is compressed according to the MPEG scheme to be transmitted in such a state that they are multiplexed into the same bit (data) stream.
In a method for producing three dimensional space modeling data defined by a plurality of nodes and image/audio data specified by a position included in the nodes, the following steps are carried out: extracting a respective position from a node of the three dimensional space modeling data; converting the extracted position into a stream ID corresponding to image/audio data associated with the position; replacing the position with the stream ID; and multiplexing the image/audio data and three dimensional space modeling data including the stream ID to produce a bit stream.
According to one aspect of the present invention, the three dimensional space modeling data is described by Virtual Reality Modeling Language (VRML), the position is represented by Uniform Resource Locator (URL) expressed in ASCII format, and the stream ID is expressed in binary format.
According to another aspect of the present invention, the stream ID is converted into a character string, and it is determined whether to replace the position of the image/audio data with the stream ID or the character string depending on whether the image/audio data is supplied by one server or multiple servers.