The present invention relates to a system for synchronously reproducing/synthesizing an audio signal, a video signal, and computer graphics data.
Known international coding standards for a system which compresses, codes, and multiplexes an audio signal (or a speech signal) and a video signal, transmits/stores the multiplexed signal, expands the transmitted/stored signal, and decodes it to the original audio and video signals are MPEG1 and MPEG2 defined by the MPEG (Moving Picture Coding Experts Group) in working group (WG) 11 in SC29 under JTC1 (Joint Technical Committee 1) for handling common matters in the data processing field of the ISO (International Organization for Standardization) and IEC (International Electrotechnical Commission).
The MPEG assumes a variety of applications. As for synchronization, systems using phase lock and systems not based on phase lock are assumed.
In synchronization using phase lock, an audio signal coding clock (sampling rate of an audio signal) and a video signal coding clock (frame rate of a video signal) are phase-locked to a common SCR (System Clock Reference).
A time stamp representing time of decoding/reproduction is added to a multiplexed bit stream. A decoding system realizes phase lock and sets a time reference. More specifically, synchronization between the coding system and the decoding system is established. In addition, the audio signal and the video signal are decoded on the basis of the time stamp, thereby realizing reproduction/display of the audio signal and the video signal which are synchronized with each other.
When phase lock is not employed, the audio signal and the video signal are independently processed and decoded in accordance with corresponding time stamps added by the coding system.
FIG. 16 shows the configuration of a system for reproducing/displaying an audio signal and a video signal from an MPEG system stream based on phase lock, which is described in ISO/IEC 13818-1, “Information Technology-Generic Coding of Moving Pictures and Associated Audio Systems”, November 1994.
Referring to FIG. 16, a demultiplexer 1 separates a bit stream in which an audio signal and a video signal are compressed and multiplexed in accordance with the MPEG standard, into a compressed audio signal stream, a time stamp, the SCR (System Clock Reference) or PCR (Program Clock Reference) of the audio signal, a compressed video signal stream, a time stamp, and the SCR or PCR of the video signal.
An audio buffer 2 buffers the compressed audio signal stream separated by the demultiplexer 1. An audio PLL (Phase Locked Loop) 3 receives the SCR/PCR of the audio signal separated by the demultiplexer 1 and generates a decoding clock. An audio signal decoder 4 decodes the compressed audio signal stream from the audio buffer 2 at a timing indicated by the time stamp of the audio signal in accordance with the decoding clock supplied from the audio PLL 3. An audio memory 5 stores the decoded audio signal supplied from the audio signal decoder 4 and outputs the audio signal.
A video buffer 7 buffers the compressed video signal stream separated by the demultiplexer 1. A video PLL 8 receives the SCR/PCR of the video signal separated by the demultiplexer 1 and generates a decoding clock. A video signal decoder 9 decodes the compressed video signal stream from the video buffer 7 at a timing indicated by the time stamp of the video signal in accordance with the decoding clock supplied from the video PLL 8. A video memory 10 stores the decoded video signal supplied from the video signal decoder 9 and outputs the video signal.
The audio PLL 3 and the video PLL 8 control the oscillation frequency such that the SCR/PCR of the coding system, which is supplied from the demultiplexer 1, matches the timer counter value of the STC (System Time Clock) of the audio PLL 3 and the video PLL 8. With this processing, the time reference of the decoding system is set, and synchronization between the coding system and the decoding system is established.
Next, the audio signal and the video signal are decoded at the timing indicated by the time stamp, thereby realizing synchronous reproduction/display of the audio signal and the video signal.
Along with recent development of the computer and LSI technologies, computer graphics (CG) is popularly used in various fields. Accordingly, attempts for integrating an audio signal (or a speech signal), a video signal, and computer graphics data and transmitting/storing the integrated data have been extensively made.
As shown in FIG. 15, a coding system 24 receives an audio signal, a video signal, and computer graphics data, codes these data, multiplexes these data or independently outputs these data to a transmission system/storage system 25.
A decoding system 26 extracts the integrated data from the transmission system/storage system 25, decodes the data, and outputs the audio signal and integrated image data of the video signal and computer graphics. Interaction from the observer using a pointing device such as a mouse or a joystick, e.g., viewpoint movement in a three-dimensional space on the display screen is received. A typical example is ISO/IEC WD 14772: “The Virtual Reality Modeling Language Specification: The VRML2.0 Specification” (VRML).
The VRML is a description language for transmitting/receiving CG data through a network represented by the Internet and forming/sharing a virtual space. The VRML supports ISO/IEC 11172 (MPEG1) which is standardized as an audio/video signal coding standard. More specifically, on the coding system side, the MPEG1 stream used the in VRML description, the sound source position of an audio signal, and a three-dimensional object on which a video signal is mapped are designated. On the decoding system side, a three-dimensional space is formed in accordance with the received VRML description, the audio sound source and the video object are arranged in the three-dimensional space, and the audio signal and the video signal are synchronously reproduced/displayed in accordance with time stamp information contained in the MPEG1 stream.
The VRML also supports an animation of a three-dimensional object. More specifically, on the coding system side, the start and end times of each event, the duration of one cycle, the contents of each event, and interaction between events are described in a script. On the decoding system side, a three-dimensional space is formed in accordance with the received VRML description, events are generated on the basis of unique time management, and an animation is displayed.
Alternatively, on the coding system side, time ti and parameters Xi (color, shape, normal vector, direction, position, and the like) of the object at the time ti are described and defined. On the decoding system side, the parameters of the object at time t (tl<t<tl+1) are obtained by interpolation, and an animation is displayed.
For the VRML, a binary format replacing the conventional script description has also been examined. This enables reduction of redundancy of a script description or shortening of the processing time for converting the script description into a high-speed rendering format on the decoding system, thereby improving the transmission efficiency and realizing high-speed three-dimensional display.
A description in, e.g., M. Deering, “Geometry Compression”, Computer Graphics Proceedings, Annual Conference Series, pp. 13-20, Aug. 1995 can be referred to as a means for reducing redundancy of a script description. An efficient system for compressing vortex data expression for describing a three-dimensional object is described in this reference.
FIG. 17 shows the arrangement of a conventional decoding system (in the VRML, this system is normally called a “browser”) for receiving the VRML description and displaying the three-dimensional space. Conventional decoding systems of this type are, e.g., “Live3D” available from Netscape, “CyberPassage” available from Sony, and “Web-Space” available from SGI which are opened to the public through the Internet.
Referring to FIG. 17, an AV buffer 21 buffers a bit stream in which an audio signal and a video signal are compressed and multiplexed. The demultiplexer 1 separates the bit stream in which the audio signal and the video signal are compressed and multiplexed, which is supplied from the AV buffer 21, into a compressed audio signal stream and a compressed video signal stream.
The audio signal decoder 4 decodes the compressed audio signal stream supplied from the demultiplexer 1. The audio memory 5 stores the decoded audio signal supplied from the audio signal decoder 4 and outputs the audio signal. A modulator 6 modulates the audio signal from the audio memory 5 on the basis of a viewpoint, the viewpoint moving speed, the sound source position, and the sound source moving speed, which are supplied from a rendering engine 15.
The video signal decoder 9 decodes the compressed video signal stream supplied from the demultiplexer 1. The video memory 10 stores the decoded video signal supplied from the video signal decoder 9.
A CG buffer 22 buffers a compressed computer graphics data stream (or a normal stream). A CG decoder 12 decodes the compressed computer graphics data stream supplied from the CG buffer 22 and generates decoded computer graphics data, and at the same time, outputs event time management information. A CG memory 13 stores the decoded computer graphics data supplied from the CG decoder 12 and outputs the computer graphics data.
An event generator 14 determines reference time on the basis of a clock supplied from a system clock generator 20 and outputs an event driving instruction in accordance with the event time management information (e.g., a time stamp) supplied from the CG decoder 12.
The rendering engine 15 receives the video signal supplied from the video memory 10, the computer graphics data supplied from the CG memory 13, the event driving instruction supplied from the event generator 14, and viewpoint movement data supplied from a viewpoint movement detector 17 and outputs the viewpoint, the viewpoint moving speed, the sound source position, the sound source moving speed, and the synthesized image of the video signal and the computer graphics data.
A video/CG memory 16 stores the synthesized image of the video signal and the computer graphics data and outputs the synthesized image. The viewpoint movement detector 17 receives a user input from a pointing device such as a mouse or a joystick and outputs it as viewpoint movement data.
Synchronization among the audio signal, the video signal, and the computer graphics data is realized by reproducing/displaying them using, as reference time, the system clock in the decoding system in accordance with the time stamp or event generation timing, as in synchronization not based on phase lock in the MPEG.
A synthesizing system for synchronizing a video signal and a computer graphics image is proposed in Japanese Patent Laid-Open No. 7-212653. This conventional synthesizing system delays a fetched video signal by a time required for generation of a computer graphics image, thereby realizing synchronous synthesis/display of the video signal and the computer graphics data.
In the conventional audio signal/video signal synthesizing/reproducing system shown in FIG. 16, processing of computer graphics data is not mentioned at all.
In addition, the conventional audio signal/video signal/computer graphics data synthesizing/reproducing system shown in FIG. 17 assumes only synchronization without phase lock, and a method of establishing synchronization between the coding system and the decoding system is not referred to.
In FIG. 17, a preprocessing section 23 enclosed by a broken line operates asynchronously with the coding system and writes decoding results in the corresponding memories 5, 10, and 13.
Reproduction of the audio and video signals read out from the memories and the animation of the computer graphics data based on an event driving instruction are executed using the system clock unique to the decoding system as reference time.
The conventional decoding system (VRML browser) fetches all audio/video/computers graphics mixed data in advance, and starts audio reproduction, video reproduction, and animation of the computer graphics data based on an event driving instruction after all decoding results are written in the memories.
For this reason, this system can hardly cope with an application to a communication/broadcasting system which continuously transfers data. In addition, since all processing operations depend on the system clock unique to the decoding system, synchronous reproduction becomes hard when the transfer delay varies.
The system proposed in Japanese Patent Laid-Open No. 7-212653 has the following problems.
(1) The system does not cope with an audio signal.
(2) The system does not cope with compression.
(3) The system does not separately consider the coding system and the decoding system.