The present invention relates generally to the transmission of encoded audio and video information. More particularly the invention relates to adaptive time shifting of an encoded audio signal relative to an encoded video signal in a received audio/video message stream according to the preambles of claims 1 and 9. The invention also relates to a system for transmission of real-time audio and video information according to claim 14.
When audio and video information is presented jointly, i.e. moving images are shown together with a matching audio signal, a certain degree of synchronicity between the audio and the video information is demanded in order for the presentation to be acceptable by the human senses. For instance, a sound that can be deduced from the observation of a particular visual event must coincide sufficiently well in time with the presentation of such image information. Typically, the lip movements of a speaker must be at least relatively well synchronized with a playback of the speaker's voice. The human perception cannot, however, distinguish a very small deviation between an audio information and a corresponding visual event. Thus, if the deviation is small enough the audio information may either be presented slightly earlier or slightly later than the video information without this fact being noticeable by a human being. Experiments have shown that a one-way skew of less than 80 ms cannot be noticed and that a one-way skew of less than 150 ms is generally acceptable.
If, however, an audio signal and a corresponding video signal are presented with a deviation between the signals exceeding 400 ms the presentation is perceived to have an exceptionally low quality. Unfortunately, the video delay in most of today's videoconference systems running at the ISDN basic rate of 128 kpbs is in the order of 400 ms (ISDN=Integrated Services Digital Network).
In GSM the audio delay is approximately 90 ms. (GSM=Global System for Mobile communication). In a solution where a voice signal is transmitted via GSM and a corresponding video signal is sent by means of a 128 kpbs video conferencing system a delay between 230 ms and 390 ms must be added to the audio signal in order to maintain a deviation of 80 ms or less between the audio and the video signal. Since the audio signal is sampled and converted into a digitally encoded signal, which is delivered in encoded audio frames at typically 20 ms intervals, the clock signal generating the delay of the audio signal must have a very high accuracy.
The patent document EP, A1, 0 577 216 describes an audio/video-interface in which a FIFO (First In/First Out) buffer is used to accomplish a constant delay of received data such that a presented voice signal is synchronized with related lip movements of a speaker. The fullness of the FIFO buffer is controlled in response to a buffer centering signal, which defines a range between an upper and a lower threshold value corresponding to a desired delay interval. If the buffer fullness falls below the lower threshold value the same data elements are repeatedly read out until a sufficient delay of the data is achieved. If, however, the buffer fullness increases above the upper threshold level new data elements are instead written over previously stored data elements until the delay is reduced to the desired level.
An alternative method and arrangement for maintaining a constant delay between a received audio signal and a received video signal is described in EP, A1, 0 598 295. Here samples of the audio signal are temporarily stored in a buffer memory to achieve a certain delay of the signal. The number of stored audio samples in the buffer memory is detected in every n:th field of the video signal. If this number reaches a specified value the read or write address of the buffer memory is preset such that the number of stored audio samples at the inspection point occurring at intervals of n video fields is kept constant.
The U.S. Pat. No. 6,104,706 discloses a solution where audio, video and possibly other kinds of data are time multiplexed into a packetized data stream in which each packet is assigned a particular priority. The packetized data stream is then transmitted in substantially the order of priority. Audio packets are given a highest priority followed by video packets. Packets containing other types of data are given the lowest priority. Continuous real time audio playback is maintained at the receiver side by delaying the playback of received audio packets in a FIFO-buffer, which provides a delay time equal to a predicted average system delay for the communications system. The audio playback is slowed or accelerated in order to shrink or grow the difference in time between the sender and the receiver.
The patent document EP, A1, 0 577 216 describes a semiautomatic system for accomplishing synchronicity between lip movements of a speaker and corresponding voice information by means of a programmable delay circuit in the audio channel. An area of the image represented by the video channel is manually defined within which motion related to sound occurs. Motion vectors are then generated for the defined area, and correlated with levels of the audio channel to determine a time difference between the video and the audio channels. The programmable delay circuit is controlled to compensate for this delay such that the voice signal can be presented in parallel with the relevant video information.
All the above-mentioned documents refer to various delays of an audio signal. It is, however, very difficult to obtain a perceptually satisfying result when applying the known solutions, if the delay is implemented by means of a system resource in a computer. In practice, computers having non-real-time operative systems namely cannot maintain a sufficient accuracy of an allocated system resource such that a delayed audio signal can be aligned in time with a video signal within a degree of deviation that can be accepted by the human perception. Naturally, it is no less possible in such a computer to decrease the deviation between such signals below what is noticeable by the human senses.