The transmission of voice and audio data over IP networks presents some inherent challenges regarding end to end quality of service. Specifically, packet loss, packet delays and packet jitter are characteristics that can significantly impact voice quality.
From an endpoint's (e.g. phone) perspective on an IP network, packet loss occurs in an arbitrary, unpredictable fashion. Packet loss is out of the endpoint's control and typically occurs due to a collision or some network overload (e.g. in a router or gateway). Since the packet loss can occur in the physical implementation of the network (e.g. collisions in cables) there is no guaranteed mechanism to inform the receiver when a packet is missing. Therefore, sequence numbers are used to allow the receiver to detect packet loss. Also, once lost, the packet is not re-transmitted since the associated delay in retransmission is prohibitive in real time telephony applications. Thus, the onus is on the receiving endpoint to implement some form of detection and compensation for packets lost in the network. The challenge in this respect is to adequately reconstruct the original signal and maintain a sufficient level of voice quality.
Packet delay and packet jitter are additional network phenomena that require measures of compensation to maintain voice quality. As packets travel from a source endpoint to a destination endpoint they are typically relayed through various routers or hubs along the way. As a result of variable queuing delays and variable routing paths, sequential periodic packets sent from a source can arrive out of order and with substantial delay and jitter at the destination endpoint. Typically a receiver manages these issues by implementing a buffer of packets to smooth the variable jitter and to allow the receiver to re-arrange packets into their proper order. Unfortunately such a buffer increases the nominal delay of the audio stream depending on its size, and as such must be minimized since audio delay has its own negative effect on voice quality. This minimization prevents 100% compensation for delay and jitter in the receiver and effectively increases the rate of packet loss in the system since a late packet cannot be inserted into an ongoing audio stream.
Most applications use the aforementioned buffer of packets to handle jitter and packet delay. Routines that manage this buffer monitor incoming sequence numbers and detect both lost and late packets. In telephony applications packets are usually delay constrained to 10, 20 or 30 ms in size. To compensate for a loss of this duration the receiving endpoint can replay a previous packet, decrease the playout rate (assuming the jitter buffer is of sufficient size), interpolate samples or implement a silence detection and insertion scheme.
Simple replaying of a previous packet is computationally trivial yet often yields unsatisfactory results since voice quality dramatically suffers as packet loss increases. A variation of this scheme is to replace the lost packet with an idle or zeros packet but this too is quite noticeable under even marginal packet loss.
Decreasing the playout rate and interpolation between samples are effectively the same thing; both alter the receive sample rate to reduce the consumption rate of samples. Playout adjustment is implemented in the prior art via hardware for adjusting the sample clock or the sample frame length, whereas interpolation is implemented in the prior art as a software method of inserting additional samples by means of averaging. Both methods have an undesirable side effect of causing a frequency shift of the signal due to the change in sample rate. To minimize the frequency shift only small adjustments to the sample rate can be made. However, under conditions of packet loss, small adjustments do not provide an adequate rate of compensation.
Prior art silence detection algorithms monitor the signal stream to determine the intervals between voice where the signal consists of merely background noise. Silence insertion is the process of using the silence detection information to insert additional silence periods to compensate for lost packets. This method can be effective if there are many silence intervals or if the jitter buffer is large enough to guarantee some silence intervals most of the time. Unfortunately in voice conversations silence periods are often very small (between words) and they cannot be guaranteed during the time frame of a typical jitter buffer. Furthermore, silence detection imposes an additional processing burden when compared to the other prior art methods of compensation.