The present invention relates to adaptive playout of audio packets transported over a packet network without reliance on packet header time-stamp information for jitter removal. In particular, the present invention relates to effective playout through removal of jitter and minimization of delay independent of packetization format.
Voice over packet networks or VoIP requires that the voice or audio signal be packetized and then transmitted. The transmission path will typically take the packets through both packet switched and circuit switched networks between each termination of the transmission. The analog voice signal is first converted to a digital signal and compressed at a gateway connected between a terminal equipment and the packet network. The gateway produces a pulse code modulated (PCM) digital stream from the analog voice.
The PCM stream is analyzed in the gateway and processed according to the parameters of the gateway, such as echo suppression, silence detection and DTMF tone detection. Detected tones are passed separately without encoding. The voice PCM samples are passed to a CODEC for processing prior to packet assembly.
The CODEC creates voice frames from the PCM stream according to the parameters of the codec used. The creation of frames from the PCM stream typically includes compression. The frames are of known size and, based upon the specified rate, are of a determinable time duration. Each frame contains a set number of bits of the PCM stream dependant on the codec used for bi-directional conversion between analog audio and digital packets.
The frames are then assembled into packets by a packet assembler which combines a set number of sequential frames into a single packet data payload. A header, such as a real time protocol (RTP) header is attached to each packet payload to provide a sequence number for identification of the packet and a time stamp for the packet. In the case of RTP format, information about the length of the packet is provided in the IP header. The gatekeeper then assigns an IP address to the packet corresponding to the designated destination of the voice signal to which the packet belongs. An IP header is added to the packet to designate the origination and destination IP addresses for the packet. A UDP header containing source and destination sockets can also be added to the packet.
The packets are routed through the packet network based upon the IP address information. The packet may pass through several switches and routers and the signal in digital and analog form may pass through both packet switches and circuit switches respectively. The packet is likely to accumulate delay as it passes between the near and far end terminal equipment, through the near and far end gateways, through the packet and PSTN networks and through switches.
Because this accumulated delay is erratic and unpredictable and further because each packet may take a different path through the networks, delay can cause the packets to arrive out of sequence and/or with gaps or overlaps. Gapping and overlapping of packets is referred to as delay and the variance in delay from one packet to the next is called jitter. Delay and jitter are measured by comparison of the end time stamp of one packet with the start time stamp of the next packet. If the next packet is received before the end time stamp of the previous packet, there is overlapping delay. If the latest packet is received after the end time stamp of the current packet, the difference in the time is the delay gap. Conditions in the packet network can also cause the loss of packets, referred to as packet loss.
Voice packets are generated at a constant rate and represent continuous and ordered speech. Voice packets should be played out at the receiving end in the same order and at the same rate to accurately reproduce the original analog speech. Because of some inherent loss and delay in a packet network, the packets are reassembled and played out as close to the original sequence and rate as possible to achieve acceptable reproduction.
The receiving gateway will first remove the IP and UDP headers from the packets. Next the RTP information is read and the voice frames extracted from the packet. The RTP information is used to ensure that the packets are in the proper sequence. If a packet is missing, or out of order, the gateway must compensate for the missing frames in that packet in order to avoid undesirable distortion of the voice signal after frame reassembly. If one or more frames in a sequence are missing, the previous frame is repeated at a decreased volume to fill in the gap(s) left by the missing frame(s). If the missing frame subsequently arrives, too late for inclusion in the reassembled sequence of frames, the packet is discarded.
In order to compensate for jitter, the receiving gateway utilizes the sequence and time stamp of the RTP header to smooth the playout by compensating for jitter and/or packet loss by removing gaps and overlaps in the frame sequence. The receiving gateway includes a Voice Playout Unit, VPU, with a FIFO memory buffer for temporary storage of the packets. The purpose of the buffer is to remove the effect of packet arrival jitter from the voice playout. This is accomplished by adding delay before playout, such that the delay is greater than or equal to the maximum jitter encountered. The maximum delay available will be determined by the size of the buffer. Any extra delay before playout will not distort the audio but may reduce audio quality because of the addition of unnecessary delay to the system which can be noticed by users if the delay is of sufficient length. Insufficient delay will cause poor playout quality, because data will be unavailable for playout when packets are late, causing the playout to hesitate and sound distorted.
The reassembled sequence of frames is processed in a codes to return the PCM stream for playout.