1. Field of the Invention
This invention relates to coding of FAPs for synthetic "talking head" video and more specifically to spatial and/or temporal transform coding of FAPs that allows the simultaneous transmission of multiple synthetic talking head sequences over a band limited channel.
2. Description of the Related Art
The existing and developing Motion Picture Expert Group (MPEG) standards provide techniques for coding and transmitting natural digital video signals over band limited channels. Natural video has a very high bandwidth and thus must be compressed. The basic approach is to perform a motion-compensated prediction on adjacent frames to reduce temporal redundancy and then a two-dimensional discrete cosine transform (DCT) on 8.times.8 pixel blocks representing the prediction error in each frame to reduce the spatial redundancy. This lossy approach realizes significant coding gain, on the order of 30:1, with minimal visual artifacts.
The MPEG-4 standard under development will also include the capability to generate and transmit synthetic "talking head" video for use in multimedia communication systems. The new standard will include a facial animation parameter (FAP) set that is defined based on the study of minimal facial actions and is closely related to muscle actions. The FAP set enables model-based coding of natural or synthetic talking head sequences and allow intelligible reproduction of facial expressions, emotions and speech pronunciations at the receiver. Currently, the FAP set contains 68 parameters that define the shape deformation or movements of a face. For example, the parameter open.sub.-- jaw defines the displacement of the jaw in the vertical direction while the parameter head.sub.-- yaw specifies the rotational yaw angle of the head from the top of the spine. All the FAPs are defined with respect to a neutral face and expressed in a local coordinate system fixed on the face. Many different encoding architectures can be designed to generate the FAP set, which will constitute the majority of the transmitted data.
Channel capacity, which is limited by modem capabilities, is currently 33.4 kbits per second for the plain old telephone services (POTS). Some state-of-the-art modems provide 56 kbits/sec downstream capability from a central location to a home but only 33.4 kbits/sec upstream. Since the 68 FAPs represented by 10 bits at a 30 Hz video rate require only 20.4 kbits/sec, it is possible to transmit them uncoded and thus preserve their visual quality. However, this approach has not contemplated nor will it support the simultaneous transmission of multiple talking heads as part of a single video signal as may occur in a virtual meeting, for example, or the transmission of the FAPs as part of larger synthetic objects, for example, full-body animation.