1. Field of the Invention
The present invention relates to a coding/decoding apparatus, a coding/decoding system and a multiplexed bit stream and particularly, to a system for synchronously combining and reproducing natural pictures, voices, and computer graphics.
2. Description of the Related Art
MPEG (Motion Picture Coding Expert Group) has been known as an international standard for coding standardization for compressing, multiplexing and transferring or storing audio signal (or voice signal), video signal, and artificial scene data such as computer graphic, and then separating and expanding the signals and data to obtain original signals. The MPEG is defined by the working group (WG) 11 within SC29 which are managed under JTC1 (Joint Technical Committee 1) for handling common items in data processing fields of ISO (International Organization for Standardization) and IEC (International Electrotechnical Commission). In the MPEG, a mechanism for synchronously reproducing each media from multiplexed data is described.
First, a mechanism for synchronously reproducing an audio signal and a video signal from multiplexed data is described in ISO/IEC 13818-1 xe2x80x9cInformation Technology Generic Coding of Moving Pictures and Associated Audio Systemsxe2x80x9d (popularly called MPEG-2 Systems). FIG. 53 of the accompanying drawings shows the construction of a fixed delay model used for the description. This figure shows an abstracted system architecture when MPEG-2 is applied to compress audio signals and video signals.
In FIG. 53, encoder 71 compresses (encodes) audio signal, and encoder 72 compresses (encodes) video signal. Buffer 73 buffers the audio data compressed by the encoder 71, and buffer 74 buffers the video data thus compressed by the encoder 72. Multiplexing circuit 75 multiplexes the compressed audio data LO stored in the buffer 73 and compressed video data stored in the buffer 74. At this time, a reference clock that is needed for synchronous reproduction and time stamps are embedded as additive information into the multiplexed data.
Specifically, the time stamps are a decoding time stamp representing a decoding timing and a display time stamp representing a display timing. The decoding time stamp is generally used only when interpolative prediction is carried out. This is because when the interpolative prediction is carried out, the decoding timing and the display timing are different from each other in some cases. In the other cases, the decoding time stamp is unnecessary.
Storage/transmission device 76 stores or transmits the multiplexed data created by the multiplexing circuit 75. Separation circuit (demultiplexing circuit) 77 separates compressed audio data, compressed video data, and a reference clock and time stamp used for synchronous reproduction from the multiplexed data supplied from the storage/transmission device 76. Buffer 78 buffers the compressed audio data supplied from the separation circuit 77, and buffer 79 buffers the compressed video data supplied from the separation circuit 77. Decoder 80 decodes and reproduces the compressed audio data stored in the buffer 78, and decoder 81 decodes and displays the compressed video data stored in the buffer 79.
The synchronous reproduction of the audio signals and video signals in FIG. 53 is implemented as follows. The reference clock embedded in the multiplexed data is used to control the oscillation frequency of a clock generating circuit for driving the decoder 80 and decoder 81, and PLL (Phased Locked Loop) is generally used. The synchronization between the encoder side and the decoder side is established by the PLL. The time stamp embedded in the multiplexed data is used to transmit the decoding timing of the decoder 80 and decoder 81 or the reproduction/display timing of the decoding result. The time axes of the encoder side and decoder side are synchronized with each other with a fixed delay being set therebetween by the reference clock, and the decoding operation is started at the time which is intended at the encoder side and the reproduction/display is carried out.
Accordingly, the synchronous reproduction of the audio signals and video signals can be implemented insofar as a suitable time stamp is set at the encoder side. In the case of an application in which synchronous reproduction isn""t needed between the encoder side and the decoder side, the synchronous reproduction is carried out with the clock of the decoder itself without using the reference clock.
Next, ISO/IEC JTC1/SC29/WG11 N1825 xe2x80x9cWorking Draft 5.0 of ISO/IEC 14996-1xe2x80x9d (popularly called MPEG-4 Systems) describes a mechanism for synchronously reproducing audio signals, video signals, and artificial scene data such as computer graphics from multiplexed data.
FIG. 54 shows a system decoder model (SDM) used for the description of the above mechanism. This model is an abstracted system decoder when MPEG-4 is applied to compress audio signals, video signals, and artificial scene data such as computer graphics. In this paper, detailed description isn""t made on the model and concrete construction of the encoder, however, it is described as syntax that a reference clock and a time stamp are embedded as additive information in multiplexed data. Specifically, there are provided two time stamps, a decoding time stamp representing a decoding timing and a composite time stamp representing a timing at which decoded data can be supplied to a composition circuit.
In FIG. 54, a separation circuit 91 separates from the multiplexed data compressed audio data, compressed video data, compressed scene data, and a reference clock and a time stamp used for synchronous reproduction. Buffer 92 buffers the compressed audio data supplied from the separation circuit 91, and buffer 93 buffers the compressed video data supplied from the separation circuit 91. Buffer 94 buffers the compressed artificial scene data supplied from the separation circuit 91. Decoder 95 decodes the compressed audio data stored in the buffer 92, decoder 96 decodes the compressed video data stored in the buffer 93, and decoder 97 decodes the compressed artificial scene data stored in the buffer 94.
Buffer 98 buffers the audio signal decoded by the decoder 95, buffer 99 buffers the video signal decoded by the decoder 96, and buffer 100 buffers the artificial scene data decoded by the decoder 97. Composition circuit 101 composes a scene on the basis of the audio signal stored in the buffer 98, the video signal stored in the buffer 99 and the artificial scene data stored in the buffer 100. At this time, the scene information that is composed is described in the artificial scene data, and in accordance with the scene information the audio signal is modulated or the video signal is deformed, and the signal is mapped to an object in the scene. Display circuit 102 reproduces/displays a scene supplied from the composition circuit 101.
The composition and reproduction of the audio signal, the video signal and the artificial scene data in FIG. 54 is implemented as follows:
The reference clock can be provided every decoder. After it is picked up from the multiplexed data, it is input to a clock generating circuit which is provided every decoder in order to control the oscillation frequency of the clock generating circuit, whereby the synchronization between the encoder side and the decoder side can be established every decoder. The time stamp can be also provided every decoder. After it is picked up from the multiplexed data, it is used to transmit the time at which the decoding timing of the decoder or the decoding result can be supplied to the composition circuit 101. The time axes of the encoder side and the decoder side are synchronized with each other with a fixed delay being set therebetween by the reference clock, and the decoding is started at the time intended by the encoder side and the writing operation into the buffer is carried out.
Subsequently, the composition circuit 101 takes out the audio signal, the video signal and the artificial scene data held in each buffer to perform scene composition. The times at which the audio signal, the video signal and the scene data are obtained by the composition circuit 101 are respectively given on the basis of the composite time stamps added to these signals and data. However, the timing for composing a scene is unclear, and the composition circuit 101 itself is set to start a event processing in accordance with a discrete time event described in the scene data. Finally, the display circuit 102 reproduces and displays the scene supplied from the composition circuit 101.
Further, as representative one of artificial scene data, VRML (Virtual Reality Modeling Language) has been known as a description format to describe computer graphics, transmit or store the data thus described, build and share a virtual three-dimensional space on the of the data. VRML is defined as international standards by SC24 managed under JTC1 (Joint Technical Committee 1) for handling common items in the data processing fields of ISO (International Organization for Standardization) and IEC (International Electrotechnical Commission) and a VRML consortium to which associated companies pertain in cooperation with each other. In this VRML, a description method of taking an audio signal and a video signal into a scene is further described.
The details of the description method are described in ISO/IEC DIS 14772-1 xe2x80x9cThe virtual Reality Modeling Language (popularly called VRML97). IN the ISO/IEC DIS 14772-1, not only computer graphics, but also ISO/IEC 11172 (popularly called MPEG-1) which is one of the MPEG standards are contained as support targets. MPEG-1 is one of coding international standards for audio signals and video signals. Specifically, the audio signals and the video signals are mapped as a sound source and as a moving picture texture for a three-dimensional object respectively in a three-dimensional scene constructed by VRML. Further, the description of a time event is supported on VRML, and a time event occurs according to a time stamp described in the VRML format.
The time event is further classified into two types; a continuous time event and a discrete time event. The continuous time event is an event in which the action of an animation or the like is continuous on time axis, and the discrete time event is an event in which an object in a scene starts after a time elapses.
FIG. 55 shows the construction of a decoding processing system for receiving the VRML format and constructs a three-dimensional scene (called as xe2x80x9cBrowserxe2x80x9d in VRML). Buffer 111 receives through the internet multiplexed data compressed by MPEG-1 and buffers the data received. Buffer 112 receives through the internet the VRML format or the compressed VRML format and buffers the format received. At this time, the original place of the VRML format may be different from that of the MPEG-1 data.
Separation circuit 113 separates compressed audio data and compressed video data from the MPEG-1 multiplexed data supplied from the buffer 111. Decoder 114 decodes the compressed audio data supplied from the separation circuit 113, and decoder 115 decodes the compressed video data supplied from the separation circuit 114. Decoder 116 decodes the compressed VRML format stored in the buffer 112. When the VRML format is not compressed, no action is taken. Memory 117 stores the audio signal decoded by the decoder 114, and memory 118 stores the video signal decoded by the decoder 115. Memory 119 stores the VRML format decoded by the decoder 116.
Composition circuit 120 synthesizes a scene on the basis of the audio signal stored in the memory 117, the video signal stored in the memory 118 and the artificial scene data stored in the memory 119. In this case, scene information to be composed is described in the artificial scene data. According to the scene information, the audio signal is modulated and the video signal is deformed, and then these signals are mapped into an object in the scene. Display circuit 121 reproduces/displays the scene supplied from the composition circuit 120.
The composite of the audio signal, the video signal and the VRML format in FIG. 55 and the reproduction thereof are implemented as follows:
After the loading of the MPEG-1 multiplexed data from the external to the buffer 111 is terminated, the decoder 114 decodes the compressed audio data and the decoder 115 decodes the compressed video data, and the audio signal and the video signal obtained through the above decoding operation are written into the memory 117 and the memory 118 respectively. Further, after the loading of the VRML format from the external to the buffer 112 is terminated, the decoder 116 decodes the VRML format when the VRML format is compressed or takes no action when the VRML format is not compressed, and then writes the VRML format thus obtained into the memory 119. After the above processing is terminated, that is, the processing of a part surrounded by a dotted line indicated by reference numeral 222 is terminated, the composition circuit 120 and the display circuit 121 start operating to perform composite (mixing), reproduction and display.
On the other hand, when it is intended that only the video signal and the computer graphics are combined with each other, a chromakey system which has been already used for the weather forecast in the present broadcasting system has been known. According to the chromakey system, a person or an object is disposed under the background whose color is specified to a single color such as blue color or the like to shoot an overall picture, and then the background-colored portion is deleted from the picture, whereby only the person or the object in front of the background can be picked up.
FIG. 56 shows the construction of a coding processing system for creating a composite picture of the video signal and the computer graphics by using the chromakey system, and compressing and multiplexing the composite picture and the audio signal. Chromakey processing circuit 131 deletes from an input video signal a portion having the color coincident with the background color. Composition circuit 132 creates a computer graphics image from artificial scene data given. Memory 133 stores a cut-out picture supplied from the chromakey processing circuit 131. In this case, memory 133 may store directly the picture data and inform merely a subsequent-stage convolution circuit 135 that the RGB value corresponding to the background color is deleted. Memory 134 stores the computer graphics picture generated by the composition circuit 132. The convolution circuit 135 overwrites the cut-out picture obtained from the memory 133 on the computer graphics image obtained from the memory 134. It may be also allowed to detect the RGB value corresponding to the background color and replace only pixels located within a specified range by a computer graphics image.
Encoder 136 compresses (encodes) the audio signal. Encoder 137 compresses the composite picture obtained from the convolution circuit 135. Buffer 138 buffers the audio data compressed by the encoder 136, and buffer 139 buffers the composite picture data compressed by the encoder 137. Multiplexing circuit 140 multiplexes the compressed audio data stored in the buffer 138 and the compressed composite picture data stored in the buffer 139. At this time, the reference clock which is necessary for the synchronous reproduction and the time stamp are embedded as additive information into the multiplexed data.
The creation of the composite picture of the video signal and computer graphics is performed in the portion surrounded by a dotted line indicated by reference numeral 141. The other portions correspond to the coding portion of the coding/decoding system shown in FIG. 53. That is, the video signal and the computer graphics are first combined with each other to obtain a composite picture, and then the composite picture and the audio signal are compressed and multiplexed. The construction of the decoding side is the same as that of FIG. 53.
The coding/decoding synchronous reproduction system of the audio signal and the video signal shown in FIG. 53 relates to the coding, multiplexing, separating and decoding for the audio signal and the video signal, and no description is made on the processing of artificial scene data such as computer graphics.
Further, in the decoding synchronous reproduction system of the audio signal, the video signal and the artificial scene data shown in FIG. 54, the decoding timing and the timing at which each data may be supplied to the composition circuit are given. However, the timing at which all the data are composed and the timing at which the composite picture is displayed are not specified. In other words, the composition circuit is set to start its composite operation freely. Further, it is suggested that the composition (mixing) is started in accordance with a discrete time event described in the artificial scene data.
However, the artificial scene data suffers a buffer delay in the decoding operation, and thus a desired time may have passed at the time when the artificial scene data are supplied to the composition circuit 101. Therefore, the artificial scene data itself cannot be used to give an accurate timing for composing. Further, when a continuous time event is described in the artificial scene data, the composition start time is different between the coding side and the decoding side in some cases. Therefore, occurrence of an accurately coincident continuous time event cannot be ensured. Particularly, in the case of animation or the like for which motion is required to be continuously represented, the position of a moving object is displaced between the coding side and the decoding side. Due to the above problem, a composite picture desired by the coding side cannot be composed while it is accurately coincident at the decoding side.
Further, the decoding and reproducing system of the audio signal, the video signal and the artificial scene data shown in FIG. 55 does not support stream data which are transmitted continuously on time axis. That is, the processing of a portion 122 surrounded by a dotted line must be finished before the reproduction is started.
Still further, in the coding/decoding synchronous reproducing system of the audio signal, the video signal and the artificial scene data shown in FIG. 56, the composite picture is degenerated into a mere two-dimensional picture at the coding side, and thus an interaction function which would be obtained by using the artificial scene data is lost. That is, there is a disadvantage that additive functions such as movement of a visual point in the three-dimensional space, and navigation cannot be implemented.
An object of the present invention is to provide a coding apparatus, a decoding apparatus, a coding/decoding system and a multiplexed bit stream which implements coding/decoding synchronous reproduction of an audio signal, a video signal and artificial scene data while excluding the disadvantage of the conventional systems described above, ensuring generation of a composite picture desired at the coding side, supporting stream data transmitted continuously on time axis, and supporting the interaction function in the decoding side.
A coding apparatus according to the present invention comprises: audio signal coding means for coding an audio signal; video signal coding means for coding a video signal; interface means for accepting information on a composite scene; scene data coding means for coding scene data supplied from the interface means; composition means for composing a scene from the audio signal supplied from the audio signal coding means, the video signal supplied from the video signal coding means and the composite scene data supplied from the scene data coding means; display means for reproducing/displaying the composite picture signal and the audio signal supplied from the composition means; clock supply means for supplying clocks to the audio signal coding means, the video signal coding means, the scene data coding means and the composition means; and multiplexing means for creating a bit stream on the basis of the time information and compressed audio data supplied from the audio signal coding means, the time information and compressed video data supplied from the video signal coding means, the time information and compressed scene data supplied from the scene data coding means, the time information supplied from the composition means and the clock value supplied from the clock supplying means.
According to the present invention, the coding apparatus further comprises means for detecting the status of the composition means and controlling the operation of the coding means of the video signal.
According to the present invention, the coding apparatus further comprises means for detecting the status of the coding means for the audio signal, the status of the coding means for the video signal and the status of the coding means for the scene data, and controlling the operation of the composition means.
According to the coding apparatus of the present invention, the clock supply means includes first clock supply means for supplying clocks to the audio signal coding means, second clock supply means for supplying clocks to the video signal coding means and third clock supply means for supplying clocks to the scene data coding means and composition means, and the multiplexing means multiplexes the clock values supplied from the first, second, and third clock supply means respectively.
According to the coding apparatus of the present invention, the clock supply means includes first clock supply means for supplying clocks to the audio signal coding means, second clock supply means for supplying clocks to the video signal coding means and composition means, and third clock supply means for supplying clocks to the scene data coding means, and the multiplexing means multiplexes the clock values supplied from the first, second, and third clock supply means respectively.
A decoding apparatus according to the present invention comprises: means for separating both of compressed data and time information of an audio signal, both of compressed data and time information of a video signal, both of compressed data and time information of scene data, time information of scene composition and clock information from a bit stream; means for decoding the audio signal on the basis of the compressed data and time information of the audio signal; means for decoding the video signal on the basis of the compressed data and time information of the video signal; means for decoding the scene data on the basis of the compressed data and time information of the scene data; means for composing a scene on the basis of the time information for the scene composition supplied from the separation means, the audio signal supplied from the decoding means for the audio signal, the video signal supplied from the decoding means for the video signal and the scene data supplied from the decoding means for the scene data; means for generating clocks according to the clock value supplied from the separating means and supplying the clocks to the decoding means for the audio signal, the decoding means for the video signal, the decoding means for the scene data and the composition means; means for reproducing/displaying the composite picture signal and the audio signal supplied from the composition means; and interface means for accepting an interaction from a viewer to the composite picture.
According to a first embodiment of the decoding apparatus, the separation means separates a plurality of independent clock values from the bit stream, and the independent clock values are input to means for supplying the clocks to the decoding means for the audio signal, means for supplying the clocks to the decoding means for the video signal, and means for supplying the clocks to the decoding means for the scene data and the composition means.
According to a second embodiment of the decoding apparatus, the separation means separates a plurality of independent clock values from the bit stream, and the independent clock values are input to means for supplying the clocks to the decoding means for the audio signal, means for supplying the clocks to the decoding means for the video signal and the composition means, and means for supplying the clocks to the decoding means for the scene data.
A multiplexed bit stream according to the present invention comprises an audio signal, a video signal and scene data, characterized in that a flag representing whether time information representing a decoding timing doubles as time information representing a composition timing is added to said time information.