Digital Video Broadcasting-Handheld (DVB-H) is a standard for providing television broadcasts and other video and audio streams to mobile or handheld devices.
In DVB-H, time slicing is used, which means that different services (i.e. different TV channels) are transmitted in respective “slices” of time or bursts. FIG. 1 shows an exemplary DVB-H transmission structure. In this example, the DVB-H Transport Stream 2 is transmitted at 2 Mbps and contains four different services, each with an average bit rate of 500 kbps. As a result of the time slicing, each service is transmitted at the maximum bit rate of 2 Mbps for a quarter of the time. Therefore, a receiving device using a single service can deactivate the DVB-H receiver for 75% of the time. Thus, time slicing is used in DVB-H to reduce the power consumption in receiving devices.
As shown in FIG. 1, in DVB-H streams, the audio and video information is sent via separate streams (and in particular User Datagram Protocol (UDP) sockets), labeled 225.0.0.1:4000 and 225.0.0.1:5000 respectively, using the Real-time Transport Protocol (RTP). This protocol is designed so that different media components of a multimedia session (for example video and audio) can be transported via different channels/routes (possibly coming from different sources: for example a microphone and a camera in video conferencing). When using RTP, it is possible for the broadcast audio and video streams to be out of synchronization by as much as a few seconds.
Consequently, the audio and video streams must be synchronized in time in the receiving device in order to avoid lip-sync problems. Even small deviations between the video and audio streams can be perceived by a user.
There are two separate synchronization issues for a DVB-H broadcast. The first synchronization issue occurs when the user selects or changes the received service (i.e. the user activates the receiving device, or switches from “Service 1” in FIG. 1 to “Service 2”). In this case, the receiving device must change to the new service and synchronize the new video and audio streams. This synchronization can take several seconds, which means that there will be a delay for the user before the new service is presented. In addition to the video and audio streams, there may be further components that need to be synchronized (for example graphics or subtitles).
The second issue is that the synchronization between the video and audio streams can drift over time, and may need to be corrected.
In RTP, these synchronization issues are mitigated by using RTP Control Protocol (RTCP) Sender Report packets that are sent along with the audio and video streams. As shown in FIG. 1, each of the video and audio streams are paired with a respective stream containing RTCP Sender Report packets, with stream 225.0.0.1:4001 carrying the RTCP Sender Report packets 4 for video stream 225.0.0.1:4000, and stream 225.0.0.1:5001 carrying the RTCP Sender Report packets 6 for audio stream 225.0.0.1:5000.
However, it can be seen from this figure that when a video and audio stream are first received, it is necessary to wait until RTCP Sender Report packets have been received for each of the audio and video streams before the streams can be synchronized (labeled the “sync point” in FIG. 1).
An exemplary structure of an RTCP Sender Report packet in accordance with the RTCP specification is shown in FIG. 2. The packet comprises a header section that specifies the version of the protocol being used (V), a padding indicator bit (P), the number of reception report blocks in the packet (RC), the packet type (PT—i.e. a sender report SR), the length of the packet in 32-bit words and the synchronization source identifier for the source of the sender report packet (SSRC). The packet also comprises a Sender Information section that specifies a 64-bit Network Time Protocol (NTP) time stamp (which is referred to herein as an absolute time), an RTP time stamp that reflects the sampling time of the first octet in the RTP data packet in the video or audio stream, a sender's packet count that shows the total number of RTP data packets transmitted by the sender up until the transmission of the sender report and a sender's octet count that shows the total number of payload octets transmitted by the sender up until the transmission of the sender report.
Every RTP data packet carries RTP time stamps that show the time elapsed since the sampling instant of the first octet in the RTP data packet. The RTP time stamps are usually specific to a particular media stream (i.e. video or audio), and use respective starting points and frequencies for counting increments in the time stamp. Thus, as different audio and video streams do not use the same time base (i.e. the clock frequency and start offsets) for the RTP time stamps, they are not directly comparable.
Therefore, every audio and video RTP stream is paired with a respective stream containing RTCP packets as described above. As shown in FIG. 2, these RTCP Sender Report packets include an NTP time stamp and an RTP time stamp that represent the same time, but in different time bases. As the NTP time stamp is identical for all of the different media components (e.g. audio and video), then it is straightforward to synchronize all of the streams. In particular, a presentation time stamp (PTS) is calculated for each component using the timing information, with the PTS indicating the time at which the relevant data sample should be retrieved from a buffer, decoded and presented to a user.
In DVB-H applications, it is recommended that RTCP packets are sent every five seconds. However, when a change in service has been made, this means that it will take up to five seconds before the next RTCP packet is received and the RTP time stamps of the audio and video streams can be related to each other using the timing information. This means that audio and video streams may be out of sync for up to the first five seconds after selecting or changing a service.
In addition, during a broadcast, when a new RTCP sender report packet is received and it is determined that it is necessary to correct the synchronization, the adjustment or correction can be perceived by the user as a slight jump or artefact in the presented audio or video.