The invention relates to a method of post-synchronizing an information stream, which information stream comprises an audio signal and a video signal, the method comprising the step of: performing a translation process to obtain at least one translated audio signal.
The invention further relates to a transmitter for transmitting an information stream comprising at least one translated audio signal and a video signal.
The invention further relates to a receiver for receiving an information stream.
The invention further relates to a communication system comprising: a plurality of stations comprising means for transmitting and means for receiving an information stream, which information stream comprises an audio and a video signal; and a communication network for linking said stations.
The invention further relates to an information stream comprising a video signal and a plurality of audio signals relating to different languages and a storage medium.
Post-synchronizing an information stream is especially known from the field of movies and television programs. Post-synchronization means that the original audio signal is replaced by another audio signal that is normally a translation of the original audio signal. This has the advantage that an audience that does not understand the original language can understand the movie without having to read subtitles. It is however annoying to the audience that the movement of the lips does not correspond anymore to the audio signal.
It is, inter alia, an object of the invention to overcome the above-mentioned problem. To this end, a first aspect of the invention provides a method characterized in that the method comprises the steps of: tracking said video signal to obtain original lip-objects; replacing said original lip-objects with new lip-objects, said new lip-objects corresponding to said translated audio signal.
The facilities to track and manipulate lip-objects are provided by an object-oriented coding technique, e.g. MPEG-4. Because of the object-oriented nature of such a coding technique, the lip-objects are regarded as separate objects that can be handled and manipulated separately. An overview of the MPEG-4 standard is given in the ISO/IEC document JTC1/SC29/WG11/N2459, October 1998, Atlantic City, further referred to as the xe2x80x9cMPEG-4 standardxe2x80x9d. Further information can be found in the ISO/IEC document JTC1/SC29/WG11/N2195, March 1998, Tokyo, which describes MPEG-4 Applications. MPEG-4 is an ISO/IEC standard developed by MPEG (Motion Picture Experts Group). This standard provides the standardized technological elements enabling the integration of the production, distribution and content access paradigms of three fields: digital television, interactive graphics applications (synthetic content) and interactive multimedia. MPEG-4 provides ways to represent units of aural, visual or audiovisual content, called xe2x80x9cmedia objectsxe2x80x9d. These media objects can be of natural or synthetic origin; this means that they could be recorded with a camera or microphone, or generated with a computer. Audiovisual scenes are composed of several media objects, e.g. audio and video objects. MPEG-4 defines the coded representation of objects such as synthetic face objects and synthetic sound. MPEG-4 provides facilities to distinguish different objects of a scene. In particular, it is possible by lip-tracking to record lips of a person as a separate object, a so-called lip-object. This lip-object can be manipulated. From the lip-object it is possible to extract lip-parameters that describe the lips on base of a lip-model. Such a lip-model can be locally stored, which makes it possible to construct lips by only sending the according lip-parameters.
According to the invention, the original lip-objects are replaced with new lip-objects that correspond to the translated audio signal. In this way, a video signal is obtained wherein lip-movements better correspond to the translated signal. The translation becomes more natural and in an ideal case the viewer will not notice that the information stream is in fact a translation of an original information stream. Lip-objects comprise lips as well as relevant parts of the face.
According to the MPEG-4 standard, media objects can be placed anywhere in a given coordinate system. Transforms can be applied to change the geometrical or acoustical appearance of a media object. Streamed data can be applied to media objects in order to modify their attributes. Synchronization of elementary streams is achieved through time stamping of individual access units within elementary streams. Usually, the new lip-objects are synchronized with the translated audio signal.
The tools for representing natural video in the MPEG-4 visual standard aim at providing standardized core technologies allowing efficient storage, transmission and manipulation of textures, images and video data for multimedia environments. These tools allow the decoding and representation of atomic units of image and video content, called video objects. An example of a video object could be a talking person or only his lips.
The face is an object capable of facial geometry ready for rendering and animation. The shape, texture and expressions of the face are generally controlled by a bit stream containing instances of Facial Definition Parameter (FDP) sets and/or Facial Animation Parameter (FAP) sets. Frame-based and temporal-DCT coding of a large collection of FAPs can be used for accurate speech articulation.
Viseme and expression parameters are used to code specific speech configurations of the lips and the mood of the speaker. A viseme is a sequence of one or more facial feature positions corresponding to a phoneme. A phoneme is a distinct speech element that represents shortest representative phonetics. Visemes perform the basic units of visual articulatory mouth shapes. A viseme comprises mouth parameters which specify the mouth opening, height, width and protrusion. The face animation part of the standard allows sending parameters that calibrate and animate synthetic faces. These models themselves are not standardized by MPEG-4, only the parameters are. The new lip-objects can always be manipulated to fit best in the video signal.
Advantageous embodiments of the invention are defined in the dependent claims. An embodiment of the invention provides a method, characterized by comprising the step of: obtaining said new lip-objects by tracking at least one further video signal, said further video signal comprising lip-movements corresponding to said translated audio signal. This embodiment describes a method to obtain the new lip-objects. Because the further video signal comprises lip-movements that correspond to the translated audio signal, the lip-objects that are derived from the further video signal correspond to the translated audio signal. Preferably, the further video signal is obtained by recording the lips of a translator or an original actor. Tracking lip-objects is performed on this further video signal to obtain the new lip-objects. It may be efficient to combine the recording of the lip-movement and the translation of the audio signal. A translator or an original actor can for example provide the translated audio signal as well as the lip-objects at the same time. The advantage of an original actor is that the correspondence of the lips is better, because the new lip-objects originate from the same lips as the original lip-objects.
A further embodiment of the invention provides a method wherein said translation process comprises the steps of: converting the original audio signal into translated text; and deriving said translated audio signal and said new lip-objects from said translated text. In this embodiment, the result of a translation process is translated text. The translated text can be obtained with keyboard input from a translator or by analyzing the audio signal. A computer may for example first convert the audio signal into text and thereafter translate the text into translated text. The translated text is in this case used to derive the translated audio signal, e.g. by use of a Text-To-Speech coder. The translated text signal is also used to derive the new lip-objects. One letter or a combination of letters in the translated text defines a phoneme as well as a viseme. The phoneme and viseme definitions are for example stored in a database. Such a TTS coder is known from the MPEG-4 standard. A TTS coder allows a text or a text with prosodic parameters (pitch contour, phoneme duration, etc) as its inputs to generate intelligible synthetic speech. It supports the generation of parameters, which can be used to allow synchronization to associated face animation, international languages for text and international symbols for phonemes. Additional markups are used to convey control information within texts, which is forwarded to other components in synchronization with the synthesized text. MPEG-4 provides a standardized interface for the operation of a TTS coder rather than a normative TTS coder itself. In general, coders are available for generating sound based on structured inputs.
A further embodiment of the invention provides a method characterized by comprising the steps of: dividing said translated audio signal into phonemes; retrieving, from a database, visemes that correspond to said phonemes; and constructing said new lip-objects from said visemes. Preferably, said translation process comprises the steps of: converting said phonemes into text; translating said text into translated text; and deriving said translated audio signal from said translated text. Analyzing an audio signal to obtain phonemes and visemes is known from the art. U.S. Pat. No. 5,608,839 discloses a sound-synchronized video system in which a stream of unsynchronized audio signal, representing speech, and video signal of a speaker, is processed by decoding the signal. A plurality of visemes is memorized corresponding to phonemes in the audio signal. Visemes are fetched corresponding to phonemes in the audio signal, and a synchronism is imparted to the video signal and audio signal by applying the fetched visemes to the unsynchronized video signal of the stream in synchronism with corresponding phonemes in the audio signal of the stream. According to an embodiment, the fetching step includes fetching visemes of the lip movement. The system is suitable for use in a videophone. In this way, the delay that occurs in both directions in a video conferencing system is shortened.
Modeling of lip-objects is a known technique, which is especially advantageous in the field of compression. A lip-object can be defined using a lip-model and specific lip-parameters. This is very useful for compression because it suffices to transmit the lip-parameters to define a lip-object. Using lip-parameters is also useful in accordance with the invention because only a selection of the parameters have to be changed. When a lip-model is available at the receiver""s end, it suffices to transmit the modified lip-parameters. If desired also the original lip-parameters may be transmitted. Preferably, the new lip-objects are constructed from the original lip-objects by modifying the lip-parameters. This leads to a best fit for the new lip-objects. In some cases, e.g. where the difference between the original and the new lip-objects is small, it may be profitable to send the new lip-parameters as difference signals to a receiver in addition to the original lip-parameters that is also used as a reference.
A transmitter according to the invention is characterized in that the transmitter comprises: tracking means for tracking said video signal to obtain original lip-objects; means for adding new lip-objects to the information stream to replace said original lip-objects, the new lip objects corresponding to said translated audio signal. If the original lip-objects in the video signal have been replaced by the new lip-objects before transmission, the information stream can be received and handled by an ordinary receiver. In a further embodiment the transmitter comprises: means for transmitting a plurality of audio signals relating to different languages and a plurality of lip-objects, which lip-objects are each linked to at least one of said plurality of audio signals. This information stream gives the receiver the possibility to select the desired language for audio as well as for video. It is known from the art to transmit multiple languages, however only in audio. By transmitting only lip-objects or even lip-parameters for multiple languages an efficient transmission and storage of multiple language movies and other audiovisual programs is obtained.
A first receiver according to the invention is characterized in that the receiver comprises: translation means for performing a translation process to obtain a translated audio signal; means for adding""said translated audio signal to the information stream; tracking means for tracking said video signal to obtain original lip-objects; means for adding new lip-objects to the information stream that correspond to said translated audio signal; and outputting means for outputting said translated audio signal and said video signal, in which video signal said original lip-objects have been replaced with said new lip-objects. This first receiver comprises translation means in the receiver. The received information stream comprises an audio and a video signal in an original language. This embodiment has the advantage that the translation in a desired (user-selected) language is performed locally, i.e. independent of any transmitter or broadcast-organization.
A second receiver according to the invention is characterized in that the receiver comprises tracking means for tracking said video signal to obtain original lip-objects; means for adding to the information stream, new lip-objects that correspond to said translated audio signal; and outputting means for outputting said translated audio signal and said video signal, in which video signal said original lip-objects have been replaced with said new lip-objects. A difference with the known receiver of U.S. Pat. No. 5,608,839 is that the new lip-objects according to the invention correspond to a translated audio signal. The original audio signal is not unsynchronized with the video signal, but the lip-movements of the original lip-objects do not correspond to the translated audio signal, because the original lip-objects correspond to the original audio signal. A database in a receiver according to the invention should comprise phonemes and visemes of the desired languages.
A third receiver according to the invention receives an information stream which comprises: a video signal, a plurality of audio signals relating to different languages and a plurality of lip-objects, which lip-objects are each linked to at least one of said plurality of audio signals; which receiver comprises: a selector for obtaining a selected audio signal from said plurality of audio signals; outputting means for outputting said selected audio signal and said video signal, said video signal comprising selected lip-objects, which lip-objects are linked to said selected audio signal.
A communication network according to the invention comprises means for performing a translation process to obtain at least one translated audio signal; means for tracking said video signal to obtain original lip-objects; and means for replacing said original lip-objects with new lip-objects, said new lip-objects being synchronized with said translated audio signal. Such a communication network comprises for example receivers and transmitters as discussed above.
Cheung et al., xe2x80x9cText-driven Automatic Frame Generation using MPEG-4 Synthetic/Natural Hybrid Coding for 2-D Head-and Shoulder Scenexe2x80x9d, Proc. Int. Conf. on Image Processing, vol. 2, Santa Barbara, 1997, pp 69-72, describes a facial modeling technique based on MPEG-4 for automatic frame sequence generation of a talking head. With the definition and animation parameters on a generic face object, the shape, textures and expressions of an adapted frontal face can generally be controlled and synchronized by the phonemes transcribed from plain text. The segmenting type can be syllable, intonational phrase or phonetics. Since human speech of any language can be decomposed into their shortest representative phonetics set, lip/facial synchronization can be achieved. Plain text will be transcribed into orthographic phonetic symbols, a computer readable phonetic alphabet. By using a high quality phoneme-to-speech synthesizer for producing the speech, text-driven lip-synch application can be easily developed. The amounts of lips"" opening and mouth shape of each frame represent the corresponding facial motion for the pronunciation of the phonemes.
None of the mentioned documents discloses or makes it obvious to replace original lip-objects with new lip-objects, which correspond to a translated signal. The documents however describe tools such as the use of lip-objects and techniques to synthesize audio from text. The manner of extracting phonemes from speech sequences, the manner of memorizing visemes corresponding to phonemes, the manner of extracting the correct facial features and applying them to a video signal are known from the art.
The aforementioned and other aspects of the invention will be apparent from and elucidated with reference to the embodiments described hereinafter.