The interest in providing real time speech, or voice, applications in packet-switched communication systems is large and increasing. One of the main striving forces being the more efficient uses of the available transmission capacity offered by packet-switched technology as compared to circuit-switched technology. In a plurality of the today existing communication systems, which offer both data transmission and voice transmission, such as GSM and UMTS, voice is handled primarily by circuit-switched technology and data by packet switched technology. A further advantage of using packet-switched technology also for voice applications is the ability to use the same technology for all types of information transmission, and thus obtain a fully integrated system. A major part of the interest has concerned speech transmission over Internet, often referred to as Voice over IP (VoIP) or Internet Telephony. The interest encompass both the traditional fixed Internet and wireless solutions, for example based on GSM or UMTS. In the following, VoIP is used to exemplify packet-switched speech transmission and the term should be interpreted as including all types of speech transmissions using packet-switched technology.
Voice over IP is regulated in series of open standards, including H323, SIP (session initiation protocol), and RTP (real time transmission protocol) which are available for controlling voice calls that are transmitted using IP. The RTP standard has been set by IETF (International Engineering Task Force) and can be studied in RFC 3550. In a VoIP communication session, at the sending side, an incoming voice signal is sampled, quantized, and digitized in chunks of predetermined size, for example 20 ms, referred to as speech frames. The signal is then decoded with the use of a speech codec. A VoIP packet is formed comprising the voice parameters, an RTP header, an UDP (User Datagram Protocol) header and an IP header. The RTP header comprises a sequence number and a time stamp. The receiving side extracts the RTP packet from the UDP segment, then extracts the voice parameters from the RTP packet. A decoder reconstructs the speech which is presented to the user on the receiving side.
As previously mentioned, one of the objectives of VoIP services is the ability to adapt to the transmission capabilities of the link or system. One means to reduce the bit rate is to exploit variable rate coding. This is utilized in GSM and known as DTX (Discontinuous Transmission), wherein if a user is silent a lower bitrate can be used even omitting some background noise frames. However, even if the bit rate is low, the transmitted speech parameters must anyway be packed into an IP/UDP/RTP packet with some extra overhead. This overhead may be further reduced to 3 or 4 bytes using header compression techniques such as ROHC (robust header compression). Lower layers, such as the data link layer and the physical layer, of the IP stack cause additional packetization overhead. In all, although the average source bitrate can be greatly reduced with variable rate coding, the parameters that are produced will still need to be transmitted with extra packetization overhead, which is not related to the size of the payload to be transmitted. Hence, VR codecs (variable rate codecs) in VoIP applications often suffer the problem that the source bit rate reductions they can provide do not translate to corresponding gross transmission rate savings. This condition is recognised in the art, and some approaches have been reported to address the problem.
The IS-95/3GPP2 codecs TIA/IS-96 and TIA/IS-733 contains a feature called blank and burst which is applied such that, given external network control one can skip encoding of a given frame to provide room for control signaling frames. The encoder memory is set to a known state, when the decoder detects the blank frame the decoder memory is also set to the same known state.
Ref. [1] to Sannek et al. discloses a method to tag frames that are possible to cover with error concealment by the use of an ECU (error concealment unit). Frames that can be covered by ECU:s are assigned lower priority such that if congestion occur in a network the lower priority packets are dropped first. A similar approach is tested in ref. [2] to Lara-Barron, but for an embedded DPCM (differential pulse code modulation) codec, where a different encoding is used for lower priority frames compared to normal-priority frames.
The blank and burst feature in IS-96 and IS-733 [3] is controlled externally, which means that it may cause very audible artifacts.
The problem with Sannek's approach in ref [1] is that the encoder is unaware of the fact that a frame has been dropped. This leads to a state mismatch between encoder and decoder. Therefore one must be conservative with the use the frame drop in the network in order to not degrade the quality of the rendered speech too much.
The problem with Lara-Barrons approach in ref. [2] is that bandwidth is not saved and packet rate is only marginally reduced.