The present invention relates generally to the compression of data in a signal having a video component and an audio component, and more particularly to a method and apparatus for reducing the data transmission requirements of signals transmitted between remote terminals of a video communication system.
Currently available video communication systems generate poor quality video images providing small display areas, jerky motion, blurring, blocky looking artefacts and in many instances the audio fails to fully synchronise with the video images. This is largely due to group delay introduced by the compression/decompression of the video signal for transmission.
The fundamental objective of recent developments in video communication systems has been to provide the best quality video image within the available data rate. Typically, video data is compressed prior to transmission and decompressed prior to generating an image following transmission.
Since the bandwidth is dictated by the available transmission medium, video communication systems requiring higher data rates generally require greater compression of the video image. Conventional compression rates for video compression systems are in the range of 100-to-1 to 300-to-1. However, high compression of the video image will invariably result in a loss in the quality of the video image, particularly in sequences with significant changes from frame-to-frame. Disadvantageously an increase in compression requires an increase in computational capability or an increase in group delay.
Recent developments in video communication systems have attempted to alleviate some of the problems described by reducing the level of data required by the receiver for generating the display video image. This has been achieved by selecting and compressing video data only from those regions of the video image containing significant changes from frame-to-frame for transmission to the receiver. However, the quality of the display video image remains compromised where the monitored event comprises a situation where high levels of motion in separate regions of the video image occur, for example in a video conference situation where the monitored event comprises a group of users.
In video conferencing situations users derive a greater comfort factor from systems that are able to generate a display image in which the video component and the audio component are synchronised. Furthermore, it has been found that users are better able to comprehend audio data (speech) where the facial movements of other users are distinct. Therefore, it is desirable to maintain and even enhance the resolution of the display video image in regions comprising the facial features of the user.
A previous patent application, EP 95301496.6 Video signal processing systems and methods utilising automated speech analysis, describes a method of increasing the frame rate of a video communication system by monitoring the utterances of the speaker and reconstructing non-transmitted frames between transmitted frames from stored facial feature information. The described system uses a fixed transmitted frame rate with reconstructed frames between to increase the effective frame rate at the receiver. The group delay problem of the prior art is not addressed by this application. This system is also prone to errors in the decoder due to error propagation and has no defined start-up method.
A video communication system and method for operating a video communication system that reduces the levels of data required by the receiver for generating the display video image has been described previously in application number EP 97401772.5. This application described transmitting only video data for regions of successive frames that contain xe2x80x9csubstantialxe2x80x9d differences frame-to-frame, while video data corresponding to the facial region of the xe2x80x9cactivexe2x80x9d user at any instant is predicted from the received audio data. It was shown to transmit the audio data (speech) to the receiver without the corresponding video data that corresponds to facial features. The received audio data was then used to predict pixels of the display video image that have changed from a preceding frame, in order that the current frame of the display video image can be reconstructed; resulting in a reduction in the data rate requirements of the video communication system and a reduction in group delay. This method invention is describe further herein in conjunction with the present invention.
A model of the facial features associated with the speech patterns of the speaker is created to generate video at the receiver for portions of the video image that change rapidly. When the model is operating within given error boundaries, it will only be necessary to transmit the audio portion of the data. Since the accuracy of this model versus the initial video stream of the speaker affects the bandwidth required to send the video stream, it is desirable to improve the accuracy of the model to minimise the required bandwidth. The present invention is concerned with this model accuracy and reducing the time taken to synchronise the model by reducing the number of degrees of freedom of the model and by making the decision in a hierarchical manner.
The method and system disclosed by the present teachings utilise a sub-phoneme decision making process to improve the accuracy of the facial model produced by limiting the number of options from which the model produced in a subsequent instant can be formed. Using this method and system the group delay of the system can be reduced to approximately the group delay for speech (i.e.  less than 20 msec) plus the sampling period used to define a subphoneme (e.g. xcx9c50 msec). Additionally, the speech and video are reproduced without significant echo as the speech and sound are substantially synchronised.