With the rapid development of Internet, stream media technologies are applied so widely that they have been used for broadcast, movie playing, remote education over Internet, online news web sites and the like.
Approaches for video and audio transport over Internet mainly include Download and Streaming. Continuous time-based media in streaming over Internet are called stream media, and the corresponding video and audio stream media are usually called video streams and audio streams.
In streaming, video/audio signals are transported in a continuous way. A part of the stream media is played at a client while the rest is downloaded at the background.
The streaming includes Progressive Streaming and Real-time Streaming. The real-time streaming refers to real-time transport, particularly suitable for a spot event, and should be provided with a connection bandwidth, which means that the quality of images may be degraded due to a lowered Internet speed, so as to reduce the demand for transport bandwidth. “Real time” refers to such an application in which the delivery of data must be kept in a precise time-based relationship with the generation of data.
At present, the stream media transport usually adopts Real-time Transport Protocol (RTP) and Real-time Transport Control Protocol (RTCP). RTP is a transport protocol for multimedia data streams over Internet, which was released by Internet Engineering Task Force (IETF). RTP is defined to be operable for a one-to-one or one-to-many transport for providing time information and stream synchronization. RTP is typically applied over User Datagram Protocol (UDP), and also over Transport Control Protocol (TCP), Asynchronous Transfer Mode (ATM) or any other protocol. RTP itself ensures only the real-time data transport, but fails to provide either a reliable transport mechanism for progressive transport of data packets or traffic or congestion control, all of which are provided merely by means of RTCP. RTCP is responsible for a management on the transport quality and an exchange of control information between active application processes. During an RTP session, each participant transports periodically RTCP packets which contain statistics data including the number of data packets sent, the number of lost data packets and the like. Thus, a server can make use of such information to change the transport rate dynamically and even the payload type. An interoperation of RTP and RTCP can optimize the transport efficiency with an effective feedback and minimum overhead, which is hence particularly suitable for real-time data transport over Internet.
RTP defines a timestamp-based synchronization method for a receiver to correctly restore the sequence of multimedia data packets and to play them. The timestamp field is, in an RTP header, indicative of time synchronization information for data packets, and is critical for the data to be restored in a proper time order. The value of the timestamp defines the Sampling Instant of the first byte of a data packet, and defines that the clock for the transmitter timestamp shall be continuous with a monotonous increase, even if there is no data to be received or transmitted. In a silent case, the transmitter has no data to be transmitted and the timestamp keeps increasing, while the receiver can be aware of no loss of data due to no loss of serial numbers of the received data packets, and can determine the time interval at which the data is output with a comparison between timestamps of a previous and a subsequent packets. The initial timestamp for a session should be selected randomly, and the unit of timestamp can be determined by the payload type.
In addition, the multimedia transport generally refers to a mixed transport of various streams which need to be played simultaneously. Therefore, how to synchronize various streams will be a major issue for multimedia stream transport. RTCP plays an important role of enabling the receiver to synchronize multiple RTP streams. When audio and video data are transported together, for instance, two streams are used for transport respectively under RTP due to their different coding, and the timestamps of the two streams run at different rates. In this case, the receiver shall synchronize the two streams for consistency of voices with images.
For the synchronization of streams, RTCP requires that the transmitter shall assign a Canonical Name uniquely identifying a data source to each stream to be transported, and different streams from the same data source have the same canonical name. Thus, the receiver can be aware of which streams are associated. Information contained in a report message from the transmitter can be used for the receiver to coordinate the timestamps in the two streams. The report from the transmitter includes an absolute time value in a format of Network Time Protocol (NTP), and this value is generated by the clock which generates the timestamp field of the RTP packet. Since the same absolute time is used to all the streams and reports from the transmitter, the receiver can compare the absolute times of two streams from the same data source so as to determine how to map the timestamp value in one of the stream to the timestamp value in the other.
Nevertheless, since multimedia streams, such as audio streams, video streams, etc., have different transport paths and environments, and network transport situations are in a complex variation and unpredictable, a delay and a jitter to the transport of audio and video streams may be caused. In order to eliminate the jitter phenomena, the receiver buffers the multimedia streams when receiving them, that is, buffers the received data packets using a buffer, and then synchronizes and plays them. Due to the jitter and buffer process, the synchronization for various streams becomes much more complex than before, and satisfactory synchronization would be out of reach with only RTP/RTCP.
The synchronization between an audio stream and a video stream is called Lip Synchronization, which is a major issue for multimedia transport. To enable voices and images to better express, the lip synchronization is used for consistency of the voices with the images, so that the audio stream expresses in real-time consistence with the images. A crucial issue is how to incorporate prior multimedia real-time transport technologies to realize the lip synchronization in a packet network environment.
To eliminate the jitter, a jitter buffer is provided at the receiver of the prior multimedia transport network. The jitter buffer is provided with a certain buffer depth and a fixed delay. For example, FIG. 1 is a schematic diagram illustrating two jitter buffers and their operating mechanisms in the prior art, where the jitter buffers 110 and 120 for audio and video streams are respectively provided with fixed delays A1 and A2. Once the time for playing the delayed media stream data in the buffers expires, the audio and video streams are respectively played.
In the prior art, since each stream has a fixed delay in the jitter buffer, the buffer can eliminate the effect resulted from the jitter, and a compensation synchronization offset can be determined based on the delay between two streams. However, the fixed delay is merely applicable for a relatively stable network. For transport over a packet network, two independent streams have different paths and different Quality of Service registrations, and hence the audio and video streams have different delays in the transport over the network. In addition, the jitter may cause the transport delays over the network to vary greatly and become unstable, so that the fixed delay in the jitter buffer can not compensate for the synchronization offset, which ultimately results in that the audio and video streams are in an absence of synchronization and the lip synchronization fails.
In practical applications, it can be concluded from the above solution: firstly, the delays of the audio and video streams in the jitter buffers are fixed and can not be adjusted dynamically, which can not be adapted to network variations. For example, in case of good network condition, the multimedia streams can be transported rapidly, and a large buffer delay may result in a systematic delay waste; and in case of a poor network condition, the jitter may be too strong to be eliminated, which may cause two streams to be in an absence of synchronization, failing to attain the synchronization effect.
Secondly, the compensation synchronization offset between two streams is fixed due to the fixed delays. When the network conditions vary, for example, becomes better or worse, the synchronization offset varies accordingly. However, the synchronization offset between two streams may be increased after the synchronization processing.
Thirdly, the audio and video streams are processed separately and share no synchronization reference between each other while being synchronized. Instead, the two streams are synchronized by introducing the fixed delays, which can not be adjusted in accordance with a feedback from the result of the synchronization between two streams.
In the prior art, a fixed delay is set for respective streams of multimedia streams, i.e., audio and video streams, which is buffered in jitter buffers without adjusting mechanism.