Convergence of the telephone network and the Internet is driving the move to packet-based transmission for telecommunication networks. As will be appreciated, a “packet” is a group of consecutive bytes (e.g., a datagram in TCP/IP) sent from one computer to another over a network. In Internet Protocol or IP telephony or Voice Over IP (VoIP), a telephone call is sent via a series of data packets on a fully digital communication channel. This is effected by digitizing the voice stream, encoding the digitized stream with a codec, and dividing the digitized stream into a series of packets (typically in 20 millisecond increments). Each packet includes a header, trailer, and data payload of one to several frames of encoded speech. Integration of voice and data onto a single network offers significantly improved bandwidth efficiency for both private and public network operators.
In voice communications, high end-to-end voice quality in packet transmission depends principally on the speech codec used, the end-to-end delay across the network and variation in the delay (jitter), and packet loss across the channel. To prevent excessive voice quality degradation from transcoding, it is necessary to control whether and where transcodings occur and what combinations of codecs are used. End-to-end delays on the order of milliseconds can have a dramatic impact on voice quality. When end-to-end delay exceeds about 150 to 200 milliseconds one way, voice quality is noticeably impaired. Voice packets can take an endless number of routes to a given destination and can arrive at different times, with some arriving too late for use by the receiver. Some packets can be discarded by computational components such as routers in the network due to network congestion. When an audio packet is lost, one or more frames are lost too, with a concomitant loss in voice quality.
Conventional VoIP architectures have developed techniques to resolve network congestion and relieve the above issues. In one technique, voice activity detection (VAD) or silence suppression is employed to detect the absence of audio (or detect the presence of audio) and conserve bandwidth by preventing the transmission of “silent” packets over the network. Most conversations include about 50% silence. When only silence is detected for a specified amount of time, VAD informs the Packet Voice Protocol and prevents the encoder output from being transported across the network. VAD is, however, unreliable and the sensitivity of many VAD algorithms imperfect. To exacerbate these problems, VAD has only a binary output (namely silence or no silence) and in borderline cases must decide whether to drop or send the packet. When the “silence” threshold is set too low, VAD is rendered meaningless and when too high audio information can be erroneously classified as “silence” and lost to the listener. The loss of audio information can cause the audio to be choppy or clipped. In another technique, a receive buffer is maintained at the receiving node to provide additional time for late and out-of-order packets to arrive. Typically, the buffer has a capacity of around 150 milliseconds. Most but not all packets will arrive before the time slot for the packet to be played is reached. The receive buffer can be filled to capacity at which point packets may be dropped. In extreme cases, substantial, consecutive parts of the audio stream are lost due to the limited capacity of the receive buffer leading to severe reductions in voice quality. Although packet loss concealment algorithms at the receiver can reconstruct missing packets, packet reconstruction is based on the contents of one or more temporally adjacent packets which can be acoustically dissimilar to the missing packet(s), particularly when several consecutive packets are lost, and therefore the reconstructed packet(s) can have very little relation to the contents of the missing packet(s).