The present invention relates generally to methods and systems for synchronizing data, and more particularly to a method and system for synchronizing multimedia streams.
A common approach in multimedia inter-media and intra-media synchronization consists of introducing time-stamps in the media stream, which time stamps carry information relative to an absolute clock reference and relative to events to be synchronized. The Program Clock Reference (PCR) and Decoding Time Stamps (DTS) and Presentation Time Stamps (PTS) of MPEG-2 are an example of such an approach. The receiver decodes the time-stamps and synchronizes the streams accordingly.
In case of small amounts of jitter due to transmission of processing delays, mechanisms of clock recovery and jitter compensation, such as Phase Locked Loops (PLLs), are employed at the receiver end. Normally, these mechanisms work satisfactorily in the case of small amounts of jitter.
The disadvantages of such approaches, however, are the lack of flexibility and adaptability. If the jitter of one of the time stamps is too high the receiver does not have the means to compensate it. In this case, synchronization is lost forever for the event to be synchronized by this time stamp. Furthermore, if the encoder does not know the exact timing of the decoder, the encoder cannot specify all the time-stamps and the method is not applicable.
The present invention is therefore directed to the problem of developing a method and system for synchronizing multimedia streams that is highly reliable despite the presence of large amounts of jitter, and that operates independently and without a priori knowledge of the timing of the decoder when encoding the data.
The present invention solves this problem by switching between a slave, transmitter-driven synchronization mode and a local (i.e., receiver-driven) synchronization mode whenever the temporal references arriving from the transmitter become unreliable. In the alternative, the present invention employs both slave and local synchronization modes to solve this problem. The present invention is particularly effective in the synchronization of variable rate data streams such as text and facial animation parameters, but is also applicable to other data streams, such as video and audio streams.
According to the present invention, a method for synchronizing multimedia streams comprises the steps of using a transmitter-driven synchronization technique which relies upon a plurality of temporal references (or time stamps) inserted in the multimedia streams at an encoding end, using an internal inter-media synchronization technique at a decoding end if a performance measurement value of at least one of the plurality of temporal references exceeds some predetermined threshold, and extracting a coarsest inter-media synchronization and/or structural information present in the multimedia stream using the transmitter-driven technique and inputting the coarsest inter-media synchronization and/or structural information to a controller employing the internal inter-media synchronization technique.
According to the present invention, it is particularly advantageous if the above method switches back to the transmitter-driven synchronization technique whenever the performance measurement value stabilizes to an acceptable value.
Another particularly advantageous embodiment of the present invention occurs when the method provides a predetermined hysteresis when switching between one synchronization technique to another to avoid oscillations between the two synchronization techniques.
According to the present invention, a system for synchronizing a multimedia stream includes a transmitter-driven synchronization controller, an internal inter-media synchronization controller and a processor. The transmitter-driven synchronization controller includes a control input, synchronizes the multimedia stream based on a plurality of time stamps in the multimedia stream inserted at an encoding end and extracts a coarsest inter-media synchronization and/or structural information present in the multimedia stream. The internal inter-media synchronization controller also includes a control input, is coupled to the transmitter-driven synchronization controller and receives the coarsest inter-media synchronization and/or structural information present in the multimedia stream. The processor receives the plurality of time stamps, is coupled to the control input of the transmitter-driven synchronization controller and the control input of the internal inter-media synchronization controller and activates the internal inter-media synchronization controller if a performance measurement value exceeds some predetermined threshold.
One particularly advantageous aspect of the present invention is that the processor activates the transmitter-driven synchronization controller whenever the performance measurement value stabilizes to an acceptable level.
It is also particularly advantageous when the processor includes predetermined hysteresis to avoid oscillations in switching between the internal inter-media synchronization controller and the transmitter-driven synchronization controller.
Further, according to the present invention, an apparatus for synchronizing a facial animation parameter stream and a text stream in an encoded animation includes a demultiplexer, a transmitter-based synchronization controller, a local synchronization controller, a text-to-speech converter, a phoneme-to-video converter and a switch. The demultiplexer receives the encoded animation, and outputs a text stream and a facial animation parameter stream, wherein the text stream includes a plurality of codes indicating a synchronization relationship with a plurality of mimics in the facial animation parameter stream and the text in the text stream. The transmitter-based synchronization controller is coupled to the demultiplexer, includes a control input, controls the synchronization of the facial animation parameter stream and the text stream based on the plurality of codes placed in the text stream during an encoding process, and outputs the plurality of codes. The local synchronization controller is coupled to the transmitter-based synchronization controller, and includes a control input. The text-to-speech converter is coupled to the demultiplexer, converts the text stream to speech, outputs a plurality of phonemes, and outputs a plurality of real-time time stamps and the plurality of codes in a one-to-one correspondence, whereby the plurality of real-time time stamps and the plurality of codes indicate a synchronization relationship between the plurality of mimics and the plurality of phonemes. The phoneme-to-video converter is coupled to the text-to-speech converter, and synchronizes a plurality of facial mimics with the plurality of phonemes based on the plurality of real-time time stamps and the plurality of codes. The switch is coupled to the transmitter-based synchronization controller and the local synchronization controller, and switches between the transmitter-based synchronization controller and the local synchronization controller based on a predetermined performance measurement of the plurality of codes.
Another aspect of the present invention includes a synchronization controller for synchronizing multimedia streams without a priori knowledge of an exact timing of a decoder in an encoder. According to the present invention, the synchronization controller operates in three possible modesxe2x80x94an encoder based synchronization mode, a switching mode and a cooperating synchronization mode. In this instance, the switching synchronization mode normally uses the encoder based synchronization technique, but switches to a decoder based synchronization technique whenever the encoder based synchronization technique becomes unreliable. The cooperating synchronization mode uses an encoder based synchronization technique to provide a first level of synchronization, and uses a decoder based synchronization technique to provide a second level of synchronization.
Finally, the present invention is applicable to the synchronization of any multimedia streams, such as text-to-speech data, facial animation parameter data, video data, audio data, and rendering.