1. Field of the Invention
Embodiments of the invention relate generally to the field of digital networking communications. More particularly, an embodiment of the invention relates to methods and systems for packet switched networking that include synchronization of audio and video streams.
2. Discussion of the Related Art
The availability of a ubiquitous Internet has enabled several forms of communication between end-users. These include non-real time messaging such as e-mail; quasi-real-time applications such as “Instant Messaging” or “chat”, and real-time communications such as speech (generally called Voice-over-IP or VoIP) and video (Video-over-IP). Generally speaking, video is always associated with speech to have an audio-visual call.
One of the issues associated with communication over a packet-switched network, such as the Internet, is the notion of variable delay. That is, each packet delivered from source to destination could experience a different delay. This packet-delay variation (PDV) is a major contributor to the reduction of end-user Quality of Experience (QoE). For example, the variability of delay in speech packets requires the deployment of a jitter buffer to ensure that the play-out mechanism is not starved of packet. The inclusion of a jitter buffer implies an increase in delay. For speech communication an increase in delay has detrimental effect on the QoE as perceived by the (human) end-user. An increase in one-way delay implies an increase in round-trip delay as well and this has a detrimental effect because of echo since a given level of echo becomes more annoying with increased round-trip delay.
For the video portion of the call, the one-way delay and round-trip delay are important but not that important as for the speech portion. However, what is extremely important in an audio-video call is the notion of “lip-synch”. Simply put, the audio must match, reasonably well, the video in a temporal manner. The term “lip-synch” is derived from the observation that in a typical human-to-human call the movement of the lips should correspond to the sound—human beings are capable of envisioning the sound from a visual rendition of lip movement and are therefore able to distinguish whether the audio and video are aligned.
FIG. 1 provides a view of the key elements of an end-user deployment. For convenience the configuration shows an external device that provides connection to the Internet. This could be an xDSL (Digital Subscriber Line) device, a Cable Modem, a wireless router (connected to an xDSL modem or Cable Modem). From the viewpoint of the invention described here the specific method of Internet Access is not material and the invention is appropriate for all internet access schemes that can support the bandwidth required for operation of Voice-over-IP and Video-over-IP and is therefore available as prior art.
The fundamental notion of audio-video synchronization, or lack thereof, can be explained with reference to FIG. 2, below.
The source of video information is a camera (“C”) or equivalent. This device generates the bits associated with the pixels that comprise each frame of video. The destination, or “sink” of video information is generally a display screen (“S”). The source of audio content is a microphone (“m”) or equivalent. This device generates the bits associated with the samples of the audio signal. The destination of audio content is generally a speaker (“s”).
Between the source and sink there are numerous stages of processing and transmission that add delay. For example, the video content coming from the camera is generally buffered (“B1”) to allow for the code executing the video signal processing to run somewhat independently of the camera speed. The buffer adds delay as does the signal processing itself (“SP-V”). The processed video signal is then packetized and in order to allow the packetization code (“P-V”) to run somewhat independently of the signal-processing, a buffering arrangement is used (“B2”). The video packets are launched into the network and experience a variable transit delay (“IP-V”) across the network. This necessitates the introduction of a jitter buffer (“B3”) that adds delay as well. The packets containing the video information are then processed (“D-V”) where the video information is extracted from the packet and decoded appropriately. This depacketization and signal processing adds delay. Since the processing speed may be different from the actual screen update speed, the video data is stored in a buffer (“B4”) from which the screen driver extracts the information to drive the actual display (screen, “S”).
A similar chain of events occurs for the audio signal. The signal from the microphone (“m”) is buffered (“b1”) and then processed (“P-A”), buffered (“b2”), packetized for delivery across the network (“P-A”), delayed by the network (“IP-A”), passed through a jitter buffer (“b3”), processed (“D-A”) and delivered through a buffer (“b4”) to the driver code that delivers the signal to the speakers (“s”).
The total delay experienced by the video and audio signals in their path between source and sink can be written as:TV=TB1+TSPV+TB2+TPV+TIPV+TB3+TDV+TB4 (end-to-end video path)  (Eq. 2.1A)TA=Tb1+TSPA+Tb2+TPA+TIPA+Tb3+TDA+Tb4 (end-to-end audio path)  (Eq. 2.1B)and if TV≠TA then there is an absence of “lip-sync” as it is apparent that the sound (audio) and picture (video) are not in alignment. In Eq. 2.1 TV is the end-to-end video delay, and TA is the end-to-end audio path delay. The other terms are defined below with a brief explanation as to their significance:
TB1: The buffering delay associated with the drivers that take video information from the camera and present it to the video-signal-processing block. When the camera sampling rate is synchronized to the clock rate associated with the signal processing, then the buffer delay is a constant.
TSPV: The delay associated with the signal processing. Very often this is subsumed in the buffering operation (B2).
TB2: The buffering delay associated with the transfer of processed (e.g. compressed or encoded) video signal information to the packetization block. This is generally a constant.
TPV: The video information is formatted into packets for delivery into the IP network. The delay could be variable if the packet launching is done “on demand” when the packet is ready, or it could be a constant if it is known that the packet delivery will be done at a constant rate. In cases where the bit-rate of the encoded video is “constant” (the constant bit-rate or CBR mode) and the packet size is predetermined as well, then the delay in this block can be calibrated. If variable bit-rate (the VBR mode) encoding methods are employed then the delay in this block is also variable.
TIPV: This delay includes all the delays associated with transmission of packets carrying the video information across a packet network. There are numerous contributors to this delay. At the source there is a variable delay based on the packet interface and the presence of other packets (of different services and applications) also contending for transmission bandwidth. At the receiver a similar situation could arise where incoming packets are held in receive buffers till they can be processed. This delay is also variable. Such pairs of transmit and receive delays are present in each intermediate device (e.g. switch or router) between the origination and terminating points, adding to the delay. The physical transmission of the signals between intermediate devices also introduces transmission delay. For a given route through the network this transmission delay will normally be fixed. If the route through the network is allowed to change then even this delay is variable. The delay can be viewed as the sum of a constant (fixed) part, TFV, and a variable part, and we consider the maximum of the variable part as TVV. Packets that are delayed by greater than this maximum values are discarded as having arrived too late to be useful.
TB3: To address the variable transit delay through the packet network (TIPV, above), a jitter buffer arrangement is used. The intent of this arrangement is to make the combination of jitter buffer and packet network appear as a constant delay. Arriving packets are placed in a first-in-first-out (FIFO) buffer and the read out by the signal processing block. The nominal separation of read-address and write-address is half the buffer size. That way, the effective delay of network and jitter buffer combination is (nominally) constant at (TFV+TVV).
TDV: This comprises the delay introduced in the extraction of video information from received packets as well as the time involved in the signal processing associated with the decoding of the video. This delay is usually known and can be calibrated.
TB4: The computations done to construct the video screen signal can be asynchronous to the presentation device and therefore there is the need for a buffer. TB4 represents the associated delay.
Tb1: The buffering delay associated with the drivers that take audio information from the analog-to-digital converter (ADC) (that converts the analog signal from the microphone to digital format) and present it to the audio-signal-processing block. When the ADC sampling rate is synchronized to the clock rate associated with the signal processing, then the buffer delay is a constant.
TSPA: The delay associated with the signal processing. Very often this is subsumed in the buffering operation (b2).
Tb2: The buffering delay associated with the transfer of processed (e.g. compressed or encoded) audio signal information to the packetization block. This is generally a constant.
TPA: The audio information is formatted into packets for delivery into the IP network. The delay could be variable if the packet launching is done “on demand” when the packet is ready, or it could be a constant if it is known that the packet delivery will be done at a constant rate. In cases where the bit-rate of the encoded video is “constant” (the constant bit-rate or CBR mode) and the packet size is predetermined as well, then the delay in this block can be calibrated. If variable bit-rate (the VBR mode) encoding methods are employed then the delay in this block is also variable.
TIPA: This delay includes all the delays associated with transmission of packets carrying the audio information across a packet network. (See the explanation of TIPV.) The delay can be viewed as the sum of a constant (fixed) part, TFA, and a variable part, and we consider the maximum of the variable part as TVA. Packets that are delayed by greater than this maximum values are discarded as having arrived too late to be useful.
Tb3: The jitter buffer arrangement for audio that is akin to the jitter buffer arrangement for video (see TB3). The effective delay of network and jitter buffer combination is (nominally) constant at (TFA+TVA).
TDA: This comprises the delay introduced in the extraction of audio information from received packets as well as the time involved in the signal processing associated with the decoding of the audio. This delay is usually known and can be calibrated.
Tb4: The computations done to construct the audio signal can be asynchronous to the digital-to-analog (DAC) converter device that provides the analog signal to drive the speakers. Therefore there is the need for a buffer. Tb4 represents the associated delay.
Due to the nature of the human visual and auditory systems, a slight inequality can be tolerated. That is, if the difference is less than D ms (milliseconds) then the lack of alignment is moot. It is well established that D is of the order of 40 ms.
The problem statement: For proper alignment between audio and video, the end-to-end path delay for both audio and video must be the same (within about 40 ms). Lack of alignment is referred to as loss of “lip-synch” and results in a severe degradation of end-user Quality of Experience (QoE) since it is very annoying.
The general approaches to “lip-synch” that have been proposed in the industry (the prior art) are briefly described here and some of the reasons why they are not robust are explained.
One approach suggested is based on what is called “Real Time Protocol” (“RTP”). The term “RTP” is often considered a misnomer because it does not always serve the purpose for what such a term would indicate. The intent of RTP is to provide a timing reference along with the information. That is, in every RTP packet, there is a 32-bit field available for a time-stamp and a 32-bit field available to identify the synchronization source (“SSRC”). The time-stamp is used to indicate the progression of time according to the clock of the synchronization source. The difference in time-stamps between two packets provides an indication of the elapsed time according to the source clock. Often the time interval unit is chosen as the sampling interval associated with the sampling of the information signal (audio or video) and the time-stamp difference between two consecutive packets will represent the number of signal samples used to generate the packet.
The approach is depicted in FIG. 3, below. Essentially, if the video information and audio information are delivered in separate RTP streams, then there will be the notion of a “Video clock” that provides the timing to control the conversion of video into digital format in the camera by providing a reference to control the sampling frequency of the analog-to-digital conversion process in the camera. The same clock is used to generate time-stamps that are inserted into the RTP packets of the video stream in the unit labeled “P-V”. Likewise, there will be the notion of an “Audio clock” that provides the timing to control the conversion of audio into digital format in the microphone by providing a reference to control the sampling frequency of the analog-to-digital conversion process in the microphone. The same clock is then used to generate time-stamps that are inserted into the RTP packets of the audio stream in the unit labeled “P-V”.
The primary use for such RTP time-stamps is to establish a suitable timing-base (frequency) to control the play-back. This is shown in FIG. 3. The block labeled V-CR recovers the video sampling frequency from the time-stamps in the video RTP stream and can provide this to the playback unit labeled “S”. Likewise, the block labeled A-CR recovers the sampling frequency from the time-stamps in the audio RTP stream and can provide this to the playback unit labeled “s”. Using the proper recovered clock (frequency) at the playback unit (“S” and “s”) is absolutely necessary for good audio/video reproduction, but note that this approach does not solve the lip-synch problem since the delays in the two paths are not addressed.
The RTP time-stamps have secondary uses as well. For example the variation in transit delay causes the time-stamps to arrive with a different time-of-arrival as would be expected from the embedded time-stamp. This difference is a measure of transit delay variation (also known as packet delay variation).
The approach used in MPEG video transmission is to generate what is called a “transport stream”. As described in the relevant MPEG standard, the information is encapsulated in “MPEG frames” comprising 188 bytes. The format of these frames, the fields present therein, and the interpretation of these fields, and the manner of concatenating information over multiple MPEG frames, and other aspects of MPEG Transport Stream (abbreviated as MPEG-TS) generation are well described in the standard. Here we provide just the principle underlying the synchronization aspects of MPEG. The principle as described is also applicable to RTP but in that case there would be technical noncompliance to the published standard.
The key to the operation of MPEG-TS is the merging of information related to audio and video into a single stream in a process that is referred to in the art as multiplexing. Consequently the MPEG frames associated with video as well as audio are placed together in IP packets, typically following the RTP format. The implication of this combination is that the delay of the IP network, generally the most significant component of the end-to-end delay, is the same for both audio and video streams. This addresses a significant portion of the cause for lack of lip-synch.
Just for reference, the time-stamps included in the MPEG stream are used for frequency synchronization as well as phase synchronization. The recovered system clock is nominally equal to the send-side system clock except for a constant delay. This constant delay is of no consequence relative to lip-synch since both the audio and video will be delayed by the same amount. The key time-stamps employed are:                a. System Clock Time-Stamp (“STC”); Program Clock Reference (“PCR”). These are used to synchronize the receive-side system time clock with the sender side system time clock. This is necessary so that all other time-stamps are valid.        b. Decode Time-Stamps (“DTS”) and Presentation Time-Stamps (“PTS”). These are required for the audio as well as for the video streams. These time-stamps allow the receive side to apply the signal processing at the appropriate juncture and align the decoded signals for delivery to the appropriate output device (screen for video and speakers for audio).        
Note that the DTS and PTS time-stamps are key to aligning the audio and video streams. That is, DTS and PTS for the audio and video streams are the current state of the art solutions for achieving lip-synch. The DTS and PTS time-stamps generated by the send-side are continually compared with the recovered system time clock. When there is agreement, the decoding or presentation of the block of data associated with the time-stamp is initiated. This approach is valid only if the recovered system time clock is locked to the send-side system time clock (up to a constant delay) and the STC and PCR are the mechanism for achieving this condition.
The multiplexing scheme described in MPEG permits the delivery of audio and video and preserves the time alignment of the two streams from the point of multiplexing through to the receiver. However, what is does NOT do is account for differences in delay between the end-point source and point of multiplexing. With reference to FIG. 2, a delay differential comprised of ΔT=(TB1+TSPV+TB2)−(Tb1+TSPA+Tb2) can remain. If ΔT is substantial, then lip-synch problems can be experienced.