In audio coding (sometimes called “audio compression”), a coder encodes an input audio signal into a compressed digital bit stream for transmission or storage, and a decoder decodes the transmitted or stored bit stream into an output audio signal. The combination of the coder and the decoder is called a codec. The input audio signal is typically partitioned into segments called “frames” and the coder encodes each frame to produce a compressed bit stream that represents the frame. As used herein, the term “frame” may alternately be used to refer to a segment of the input audio signal or the compressed bit stream that represents such a segment.
In a voice over packet network, such as a Voice over Internet Protocol (VoIP) network, frames of encoded voice signals must be encapsulated within the payload of one or more packets prior to transmission. Most conventional speech coders that packetize encoded voice signals do not allow a single frame to be split across multiple packets. In fact, the well-known Real-time Transport Protocol (RTP) standard—an Internet Engineering Task Force (IETF) standard that defines a protocol for delivering audio and video over the Internet—specifically discourages the splitting of frames across multiple packets. This is because most speech decoders require an entire frame of encoded voice data to be present to successfully perform a decoding operation. Thus, if a frame was split across multiple packets and one of the multiple packets was lost during transmission (or delayed long enough so as to be deemed lost), most conventional decoders would be unable to use the remaining packets even if they were received successfully. Thus it can be seen that allowing frames to be split across multiple packets has the effect of increasing the packet loss rate of a communication system.
A fundamental deficiency of voice over packet networks is that the end-to-end delay or latency associated with a telephone call is inevitably higher than that of conventional circuit-switched networks. In part, this is because a circuit-switched network can perform sample-by-sample transmission of voice signals. That is to say, in a circuit-switched network, each sample of input speech is encoded into a small number of bits (e.g., 8 bits) using a technique such as pulse code modulation (PCM) and then the bits are immediately transmitted over the network. In contrast, as described above, in a voice over packet network, at least one entire frame of encoded voice signals must be collected and packetized before transmission can occur. For example, a coder in a voice over packet network that encodes 8 kHz-sampled speech at a bit rate of 16 kilobits/second (kbit/s) with a 20 millisecond (ms) frame size must collect and packetize at least 40 bytes of encoded data before transmission can occur.
Achieving low end-to-end delay is important for two-way communications because if the delay becomes too long, call quality will suffer. For example, any acoustic or electric echo associated with an end-to-end connection will become more noticeable as delay increases. This is because the longer the echo is delayed, the easier the ear can detect it. In order to address this problem, echo cancellers that are capable of providing increased echo attenuation are typically used. This, in turn, increases the cost and complexity of the telephony devices used for voice communication. Significant delays, such as delays that are 150 ms. or longer, can cause real problems in terms of interaction between participants in a phone conversation, causing each participant to talk over the other one and also to miss what the other participant is saying.
As noted above, a coder in a voice over packet network must accumulate and packetize at least one frame's worth of encoded voice signals prior to transmission. Most conventional low bit-rate codecs (i.e., codecs that operate at the rate of 2 bits per sample or lower) use at least a 10 ms frame size. For example, G.729 codecs use a 10 ms. frame size. Many other conventional low bit-rate codecs use a frame size as large as 20 ms. or 30 ms.
One way of reducing the delay associated with voice over packet communication is to reduce the frame size, thereby decreasing the amount of encoded data that must be accumulated and packetized prior to transmission. BroadVoice™ is a speech codec family developed by Broadcom Corporation of Irvine Calif. for VoIP applications, including Voice over Cable, Voice over DSL, and IP phone applications. The BroadVoice™ codec family contains two codec versions. The narrowband version of BroadVoice™, called BroadVoice16, or BV16 for short, encodes 8 kHz-sampled narrowband speech at a bit rate of 16 kbit/s. The wideband version of BroadVoice™ called BroadVoice32, or BV32, encodes 16 kHz-sampled wideband speech at a bit rate of 32 kbit/s. To minimize the delay in real-time two-way communications, both BV16 and BV32 encode speech with a very small frame size of 5 ms. This allows VoIP systems based on BroadVoice™ to have a very low end-to-end system delay, by using a packet size as small as 5 ms if necessary. For example, by using a 5 ms packet size, a VoIP system based on BV16 can transmit a packet after encoding and packetizing only 10 bytes of data and a VoIP system based on BV32 can transmit a packet after encoding and packetizing only 20 bytes of data.
However, one drawback associated with using a small frame and packet size for transmitting encoded voice signals is that the packet payload will be relatively small compared with the packet header. Many VoIP networks use a combination of Real Time Protocol (RTP), User Datagram Protocol (UDP) and Internet Protocol (IP) to transport voice packets over the Internet. For RTP/UDP/IPv4, the packet header length typically amounts to 40 bytes, while for RTP/UDP/IPv6, the header length typically amounts to 60 bytes. As discussed above, a system based on BV32 can transmit packets having only a 20-byte payload, while a system based on BV16 can generate packets having only a 10-byte payload. Thus, a system using RTP/UDP/IP and BV32 with a frame/packet size of 5 ms would transmit packets in which the header is two to three times the size of the payload and a system using RTP/UDP/IP and BV16 with a frame/packet size of 5 ms would transmit packets in which the header is four to six times the size of the payload. The net effect of transmitting packets having such a disproportionately large header is that the effective bit rate of the system is substantially decreased. Stated another way, the net effect is that a large amount of transmission bandwidth is “wasted” transporting packet header information rather than encoded speech. This is highly undesirable, particularly when the network is heavily loaded and transmission bandwidth is limited.
One way of reducing the overhead of large packet headers is to implement a packet header compression scheme, a variety of which are known in the art. Generally speaking, packet header compression works by suppressing selected packet header fields in a series of packets communicated from a transmitting device to a receiving device. The selected packet header fields are typically non-varying or vary in some predictable way such that they can be reconstructed by the receiving device based on “learned” initial values for those fields. However, it is not always feasible to implement packet header compression in a communication system. For example, because implementing a packet header compression protocol requires special logic to be installed at every communication end-point, it may be simply too expensive or inconvenient too deploy.
Another method of reducing the overhead of large packet headers is to place a greater amount of encoded speech in each voice packet when network congestion increases. One example of such a system described in U.S. Pat. No. 6,421,720 entitled “Codec-Independent Technique for Modulating Bandwidth In Packet Network” issued to C. W. Fitzgerald. Fitzgerald discloses modifying the amount of encoded speech information in each transmitted packet based upon the use of end-to-end packet delay over the path carrying the voice packets as a measure of network congestion.
Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through comparison of such systems with the present invention as set forth in the remainder of the present application with reference to the drawings.