Along with continuous development of Internet technologies, real-time media services, such as voice communication (also known as network telephone, etc.), network audio/video, provided by the network become quite popular. However, quality status of current network can not fully meet requirements of the real-time media services. Inherent problems in the network, such as time delay, jitter, packet loss and out-of-order, have an impact on the network real-time media services, and thus directly affect the Quality of Service (QoS) of the real-time media services.
With reference to each factor that affects the real-time media services, the network time delay characteristics, such as time delay jitter, is very common. The time delay jitter refers to changing of transmission time delay for adjacent data packets in the network. Taking instant voice communication for example, a sender sends voice frames to the Internet with the same time interval, such as 10 ms, and the Internet forwards the voice frames to a receiver. When the Internet is in an ideal condition, the time interval that the voice frames arrive at the receiver is the same as that when the voice frames are sent, so as to make the voice played by the receiver is consistent with the voice sent out by the sender, thus the requirements of voice communication may be satisfied.
However, the quality status of actual network can not meet the requirements of the real-time media services. Since the voice frames sent out by the sender usually experience different routing and network congestion in the network, which makes the time delay of each voice frame arriving at the receiver through the network different from each other. In this case, the time interval that the voice frames arrive at the receiver is no longer totally consistent with the time interval when the voice frames are sent, which leads to producing the distortion in the voice played by the receiver. For instance, the playing is stopped after playing the received voice frames, to wait for the arrival of subsequent voice frames, or loss of voice frames occurs because of buffer overflow, etc., which directly affects the QoS of the voice communication.
Since the network time delay jitter can not be eliminated fundamentally, the effect of the network time delay jitter can only be alleviated through some measures.
A technology for processing the network time delay jitter is Time-scale anti-jitter technology. The technology stores the received voice frames with a buffer of the receiver (sometimes known as a jitter buffer), and performs time-domain tension or compression processing for all the voice frames stored in the buffer adopting a Synchronous Overlap and Add (SOLA) algorithm, a Pitch Synchronous Overlap and Add (PSOLA) algorithm, or a Waveform-similarity-based Synchronous Overlap and Add (WSOLA) algorithm, etc. The method specifically includes the following. When determining that the displaying for all the voice frames in the buffer will be finished before new voice frames arrive, i.e., when the network time delay increases, the time-domain tension processing is performed for all the voice frames stored in the buffer adopting any of the above algorithms, so as to extend the playing time of the voice frames. On the contrary, when determining that a lot of voice frames will be received by the buffer in a short time, i.e., when the network time delay reduces, in order to avoid the overflow and loss of the voice frames in the buffer, the time-domain compression processing is performed for all the voice frames stored in the buffer adopting any of the above algorithms, so as to shorten the playing time of the voice frames.
The Time-scale technology adjusts to changes of the network time delay jitter by adjusting the playing time of all the voice frames stored in the buffer. However, original sampling frequency of the voice will change because the time-domain processing is performed for the voice frames. Thus, the problem of voice distortion will be produced, if the receiver plays the voice frames after the time-domain processing according to the original sampling frequency. The specific representation is that the voice speed is faster or slower. Thus, it can be seen that in order to deal with the network time delay characteristics, the Time-scale technology may lead to interruption of the media playing and frame loss, etc., and may make the played media files generate the distortion in another form, which does not fundamentally reduce or eliminate the effect generated by the network time delay jitter on the real-time media services.