The Real-Time Protocol (RTP) is a well-known standard for transmitting real-time media data such as audio or video streams. While it does not guarantee real-time delivery of data, RTP does provide mechanisms for synchronizing multiple source media streams at a single destination, i.e., a single receiver or endpoint device. These mechanisms, for example, allow an endpoint to synchronously play out received audio and video streams using media rendering devices (e.g., an audio speaker and video monitor). To facilitate the synchronous playout of multiple streams at a given destination, RTP packets typically contain RTP timestamps, which define a time at which the payload of an RTP packet was sampled, in units of the sampling clock frequency. The RTP timestamps of each stream, however, are not directly related to one another. In order to relate the RTP time bases of different streams, the sender periodically issues RTP Control Protocol (RTCP) packets, which contain information that maps the RTP timebases of each stream into a common reference or “wall clock” timebase, using the format of timestamps in the Network Time Protocol (NTP). The sender uses the same reference timebase for each stream sent to each receiver. The receiver uses this RTCP information to determine the relative mapping between multiple streams arriving from the same sender, which ensures that the audio and video streams are played out at the rendering devices with the proper relative timing relationship to achieve synchronicity.
While a receiver normally uses the sender NTP timebase to establish the relative relationship between audio and video streams, it cannot establish the absolute real-time at which the streams should playout at the rendering devices. As a result, when multiple receivers attempt to play a single source RTP stream, synchronicity is problematic. This is due to the fact that the end-to-end delays (from the sender's sampling of a media input to the receiver's rendering device) are different for each receiver. By way of example, variations in the delays may result from differences in the average input jitter buffer depth, differences in the decoding delay, and variations in the rendering delays among the different receivers.