The present invention is related to the article xe2x80x9cUsing Speech Acoustics to Drive Facial Motionxe2x80x9d, by Hani Yehia, Takaaki Kuratate and Eric Vatikiotis-Bateson (Proceedings of the 14th International Congress of Phonetic Sciences, Vol.1, pp.631-634, American Institute of Physics, August 1999), which follows attached.
1. Field of the Invention
The present invention is an electronic communication technique. More specifically, it consists of a method and system used for digital encoding-decoding of audiovisual speech, i.e. facial image and sound produced by a speaker. The signal is encoded at low bit-rates. The speech acoustics is represented in a parametric form where as facial image estimated from speech acoustic parameters by means of a statistical model.
2. Description of the Background Art
Developments in wide area computer networks and edigital communication techniques have contributed to the practical use of video conference systems. These systems enable persons at remote locations to have a conference through a network. Also, telephone communication can be expanded to incorporate video information by means of digital cameras (CCD) currently available. Such systems, however, require bit-rates sufficiently low so that the users"" demand is compatible with channel capacity.
Using conventional techniques, transmission of image signals require a bit-rate between two and three orders of magnitude larger than that required for the transmission of telephone speech acoustics. Thus, if video is to be transmitted over a telephone line, the frame rate has to be very low.
One way to solve this problem is to increase the bit-rate capacity of the channel Such a solution is, however, expensive and, hence, not practical. Moreover, the increasing demand for real time video communications justify efforts in the direction innovative video compression techniques.
Video compression rate is limited if done without taking into account the contents of the image sequence that forms the video signal. In the case of audiovisual speech coding, however, it is know that the image being encoded is that of a human face. The use of this information allows the development of compression techniques which are much more efficient. Furthermore, during speech, the acoustic signal is directly related to the speaker""s facial motion. Thus, if the redundancy between audio and video signals is reduced, larger compression rates can be achieved. The technique described in this text goes in this direction.
The objective of the present invention is to provide a method and system of audiovisual speech coding, which is capable of transmitting and recovering a speaker""s facial motion and speech audio with high quality even through a channel of limited capacity.
This objective is achieved in two steps. First, facial images are encoded based on the a priori information that the image being encoded is that of a human face. Second, the dependence between speech acoustics and facial motion is used to allow facial image recovery from the speech audio signal.
In the present invention, the method of transmitting facial image includes the following steps: (1) setup, at the receiver, of a facial shape estimator which receives the speech audio signal as input and generates a facial image of the speaker as output; (2) transmission of the speech audio signal to the receiver, and (3) generation of the facial images which form the speaker""s video signal.
Thus, transmission of only the speech audio signal enables the receiver to generate the speaker""s facial video. The facial image can then be transmitted with high efficiency, using a channel of far lower bit-rate, as compared with the transmission bit-rate required for standard image coding.
Preferably, the setup step is divided in the following parts: (1.a) specification of an artificial neural network architecture to be used at both transmitter and receiver sides; (1.b) training of the artificial neural network on the transmitting side so that facial images determined from the speech audio signal match original facial images as well as possible; and (1.c) transmission of the weights of the trained artificial neural network to the receiver.
The artificial neural network of the transmitter side is trained and its parameters are sent to the receiver side before communication starts. So, the artificial neural network of the receiving side is set identically to that of the transmitter side when communication is established. Thus it is ready for audiovisual speech communication using only the speech audio to recover the speech video counterpart.
Preferably, the step of neural network training consists of measuring coordinates of predetermined portions of a speaker""s face during speech production on the transmitting side; simultaneous extraction of parameters from the speech audio signal; and adjusting the weights of the artificial neural network using the speech audio parameters as input and the facial measured coordinates as reference signal.
The artificial neural network is trained for each speaker. Therefore, efficient real time transmission of facial images of an arbitrary speaker is possible.
Preferably, the method of face image transmission also includes the following steps: measuring, for each frame, coordinates of predetermined portions of the speaker""s face during speech production; applying the speech audio signal to the trained artificial neural network of the transmitting side to obtain estimated values of the coordinates of the predetermined portions of the speaker""s; and comparing measured and estimated coordinate values to find the estimation error.
As the error between the estimated coordinate values of the predetermined positions of the speaker""s face estimated by the artificial neural network and the actual coordinates of the predetermined positions of the speaker""s face on the transmitting side is found, it becomes possible to determine to which extent the face image of the speaker generated on the receiving side through communication matches the speech.
Preferably, the method of face image transmission further includes the following steps: transmitting the estimation error to the receiving side; and correcting the output of the artificial neural network on the receiving side based on the estimation error received. The precision used to transmit the estimation error is, however, limited by the channel capacity (bit-rate) available.
As the error signal obtained on the transmitting side is transmitted to the receiving side, it becomes possible to correct the image obtained on the receiving side by using the error signal. As a result, a video signal of the speaker""s face matching the speech signal can be generated.
Preferably, the method of face image transmission further includes the following steps: comparing magnitude of the estimation error with a predetermined threshold value; when the magnitude of the error exceeds the threshold value, transmitting the error to the receiving side in response; and correcting the output of the artificial neural network on the receiving side based on the received error.
As the error signal obtained on the transmitting side is transmitted to the receiving side when the magnitude of the error signal obtained on the transmitting side exceeds the predetermined threshold value, it becomes possible to correct the image obtained on the receiving side by using the error signal. As a result, the video signal of the speaker""s face matching the speech signal can be obtained. The error signal is not always transmitted, and the bit-rate used to transmit it is chosen so that transmission of the speech signal is not hindered.
According to another aspect of the present invention, the system for transmitting the audio signal and the video signal of the face of a speaker during speech production on the transmitting side to the receiving side includes: a transmission apparatus for transmitting the speech audio signal produced by the speaker to the receiving side; a facial shape estimation unit receiving the speech audio signal produced by the speaker transmitted from the transmitting apparatus and outputting an estimated facial shape of the speaker; and a receiving apparatus including a face image generation unit which generates facial images of the speaker. These images are based on the facial shapes estimated from the speech audio signal by the facial shape estimation unit.
It is possible for the receiving side to generate the video signal of the facial shape of the speaker on the transmitting side by simple transmission of only the speech audio signal from the transmitting side to the receiving side.
Thus, transmission of only the speech audio signal enables the receiver to generate the speaker""s facial video. The facial image can then be transmitted with high efficiency, using a channel of far lower capacity, as compared with the transmission bit-rate required for standard image coding.
Preferably, the transmitting apparatus further includes a circuit which compares an estimation error with a predetermined threshold value and transmits the error to the receiving side when it exceeds the threshold value. Preferably, the receiving apparatus further includes an error correction circuit which corrects the outputs of the receiving side artificial neural network based on the received error.
As the estimation error signal obtained on the transmitting side is transmitted to the receiving side when its magnitude exceeds the predetermined threshold value, it becomes possible to correct the image obtained on the receiving side by using the estimation error signal. As a result, a video signal of the speaker""s face matching the speech signal can be obtained. The error signal is not always transmitted, and the bit-rate used to transmit it is chosen so that transmission of the speech signal is not hindered.
According to a further aspect of the present invention, the apparatus for transmitting face image used in the system for transmitting the audio and video signals of the face of a speaker during speech production on the transmitting side to the receiving side includes: a circuit for transmitting the speech audio signal to the receiving apparatus; an artificial neural network capable of learning, using parameters extracted from the speech produced by the speaker as input, and giving information from which the face image of the speaker can be specified as output; and a circuit for transmission of feature parameters of the artificial neural network to the receiving side.
As the artificial neural network on the transmitting side is trained before communication and the feature parameters of the trained artificial neural network are transmitted to the receiving side, the artificial neural network on the receiving side can be set identically to the artificial neural network on the transmitting side. Therefore, it is possible for the receiving apparatus to estimate the speaker""s facial shape and to generate motion pictures of the face using the artificial neural network from the speech audio signal received. Thus real time transmission of motion pictures of the face is possible with a small amount of data per frame.
Preferably, the apparatus for facial image transmission includes: a measuring circuit for time sequentially measuring coordinates of predetermined portions of the speaker""s face; a feature extracting circuit for time sequentially extracting feature parameters of the speech produced by the speaker; and a circuit for training of an artificial neural network, which uses the speech audio features obtained by the feature extracting circuit as input, and the coordinates measured by the measuring circuit as the teacher signal. These data are obtained simultaneously while when the speaker reads training sentences.
As the artificial neural network is trained for each speaker during the production of training sentences, highly efficient real time transmission of the face image of any speaker is possible.
According to another aspect of the present invention, the apparatus for recovering the face image includes: a facial shape estimation unit receiving feature parameters extracted from the speech acoustics produced by the speaker and outputting an estimated signal of the speaker""s facial shape; and a face image generation circuit which generates a video signal of the speaker""s facial shape. The sequence images that form the video signal is based on the facial shapes estimated from the speech audio signal by the facial shape estimation unit.
It is possible for the face image recovering apparatus to generate the video signal of the speaker""s face directly from speech audio signal produced by the speaker and transmitted to the receiving side. Thus, the facial image can be recovered with high efficiency, using a channel of far lower capacity (bit-rate), as compared with the transmission bit-rate and storage required for standard image coding. According to a still further aspect of the present invention, the apparatus for transmitting face image used in the system for transmitting the audio and video signals of the face of a speaker during speech production on the transmitting side to the receiving side includes: a speech audio signal transmission circuit transmitting the speaker""s speech to the receiving apparatus; a trainable artificial neural network which uses feature parameters of the speech produced by the speaker as input, and yields as output information for which the speaker""s face image can be specified; and a parameter transmission circuit which transmits the feature parameters (weights) of the artificial neural network to the receiving side.
According to a still further aspect of the present invention, the apparatus for recovering face image includes: a facial shape estimation circuit which receives feature parameters extracted from the speech produced by the speaker and outputs an estimated facial shape of the speaker while he/she produces speech; and a face image generation circuit which generates a video signal of the speaker""s facial shape. This video signal is based on the estimated speaker""s facial shape output by the facial shape estimation circuit upon reception of the speech feature parameters.
The foregoing and other objects, features, aspects and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings.