Most modern digital television broadcast systems, including so called “On-Demand” television services, Internet Protocol Television (IPTV) and the like, use digital compression and transmission techniques to deliver the audiovisual content to the end viewer.
In these digital audiovisual systems, the audio and video data are compressed by encoders (using commonly used compression standards, such as MPEG-2, MPEG-4, H.264 and the like) to produce individual compressed Elementary Streams for each data type, which are then packetized into a Packetized Elementary Stream (PES). Other data such as Subtitles data may also be packetized into PES packets. Multiple PES packets are in turn are combined into Transport or Program Streams, which also include other non-PES packets containing data such as Service information. The Transport or Program Streams are then sent over a communication network (e.g. digital TV broadcast system, or network based IPTV system) for delivery to the end viewer.
The Elementary Stream is packetized by encapsulating sequential data bytes from the Elementary Stream output from an encoder inside PES packets, which include PES headers.
A typical method of transmitting Elementary Stream data from a video or audio encoder is to first create PES packets from the elementary stream data and then to encapsulate these PES packets inside Transport Stream (TS) packets or Program Stream (PS) packets. The TS/PS packets can then be multiplexed with Service Information and transmitted using standard broadcasting techniques, such as defined by Digital Video Broadcasting (DVB) and ATSC (Advanced Television Systems Committee).
The Packetized Elementary Streams comprise Access Units, each Access Unit containing a small encoded portion of the video, audio or other type of data. Access units are similar in many ways to Internet Protocol (IP) packets, as used in computer networks like the Internet, in that they are packetized data including header data for describing and controlling how downstream equipment handles the respective payload data.
The end viewer uses a receiver, including decoding apparatus, to receive and decode the audio visual data from the received Transport Streams, for play back on a TV or other viewing apparatus. The receiver and decoding apparatus may be in the form of an integrated digital TV, digital TV set top box, or some kind of PC with the necessary decoding apparatus connected (e.g. a USB digital TV dongle).
Since the video, audio and any other associated data are sent over separate Packetized Elementary Streams, there are mechanisms in place to align the data together, so that it is played back at the end viewer in synchronisation. This is important because the audio, video or other data is often meant to be output in synchronisation with each other, to maintain lip sync, subtitle sync, etc.
Typically, to ensure synchronisation across the transmission system, e.g. between an encoder's clock reference and the local clock reference in the decoder (or other downstream equipment, such as a re-multiplexer), the individual Packetized Elementary Streams within a single Transport Stream are synchronised using a central Program Clock Reference (PCR). This is achieved by sending a PCR timestamp based upon the local encoder's clock reference out periodically in the output Transport Stream from an encoder, so that the downstream equipment's local clock reference can be updated with the requisite timing data from the encoder clock reference.
A Transport Stream is typically formed at the output of a multiplexer, which aggregates a number of Packetized Elementary Streams being output from multiple encoders (or being outputted from a memory store, having previously been encoded). If the encoders are local to one another, the encoders should be locally synchronised to a single clock.
There is also typically provided a System Clock Reference (SCR), which is a time stamp output within a Program Stream, as opposed to the Program Clock Reference (PCR), which appears in the aggregated Transport Stream that may contain multiple programs. In most common cases, the SCR values and PCR values function identically. However, in the MPEG-2 standard, the maximum allowed interval between SCRs is 700 ms, while the maximum allowed interval between PCRs is 100 ms. Both Program Streams and Transport Streams use Presentation Time Stamp (PTS) and Decode Time Stamp (DTS) for Access Unit decoding and presentation.
The Presentation Time Stamps indicate the instant when an access unit should be removed from the receiver buffer of a decoder, instantaneously decoded, and then presented for display.
The Decode Time Stamp indicates the time at which an Access Unit should be instantaneously removed from the receiver buffer and decoded. It differs from the Presentation Time Stamp only when picture reordering is used for B pictures. B pictures are encoded pictures which take input from other pictures in the sequence, either before or after the current picture. B pictures provide the greatest compression, but require a buffer to work, as data from before and after the point being decoded (or encoded) is required. If DTSs are used, PTSs must also be provided in the bit stream.
However, these synchronisation mechanisms are not infallible; hence delays can occur between associated Elementary Streams. The delays can be caused by a multitude of reasons, for example, the transmission equipment may be set up incorrectly, badly implemented (so that it does not synchronise with equipment from other manufacturers), or it can simply break down over use. Hence, there is a need to measure audiovisual (AV) delays that occur in working digital audiovisual transmission systems, in real-time, which may reside anywhere including the encoders, communications medium and decoders.
Previous solutions to measuring AV delay have been mainly focused in the uncompressed domain; that is, measurements from prior to the audio and video streams being encoded, to after they have been decoded at the viewer's end. These techniques have generally required a watermark to be added to the video stream, to which the audio may then be compared, i.e. this requires special test video and audio streams with which to make a measurement.
These prior efforts all have one thing in common, in that they attempt to measure the absolute delay between an audio and visual stream when compared with perfect timing synchronisation.
Furthermore, in some cases, multiple Transport Streams are re-multiplexed downstream, so that Packetized Elementary Streams from one Transport Stream may be recombined with other Packetized Elementary Streams from another Transport Stream, perhaps for local content insertion purposes. The donor Transport Stream may contain PCR, PTS and DTS timing errors resulting from poor encoder output that may then be manifested in the new, resultant Transport Stream.