FIG. 1 depicts an example of a conversation session between two terminals 1, 2 using VoIP. In this non-limiting example the terminals 1, 2 are communicating with each other via a wireless communication network 3 and the internet 4. The communication is based on packet transmission using a real-time protocol such as RTP. The RTP packets are encapsulated in packets of a lower layer protocol, such as Internet Protocol (IP). A packet data protocol (PDP) context is created for the VoIP session. The wireless communication network reserves some network resources for the PDP context. These network resources are called as radio bearers in 3rd generation wireless communication systems. During the conversation audio information such as speech is converted into digital form in the terminals 1, 2. The digital data is then encapsulated to form packets which can be transmitted via the networks to the terminal on the other side of the connection. That terminal receives the packets and performs the necessary steps to recover the audio information.
In the following, it is assumed that the real-time protocol (RTP) and real-time control protocol (RTCP) traffic are carried in the same PDP context and radio bearer.
The real-time transport protocol (RTP) provides end-to-end delivery services for data with real-time characteristics, such as interactive audio and video. Those services include payload type identification, sequence numbering, timestamping and delivery monitoring. Applications typically run RTP on top of UDP to make use of its multiplexing and checksum services; both protocols contribute parts of the transport protocol functionality. However, RTP may be used with other suitable underlying network or transport protocols. RTP supports data transfer to multiple destinations using multicast distribution if provided by the underlying network.
The audio conferencing application used by each conference participant sends audio data in small chunks of, for example, 20 ms duration. Each chunk of audio data is preceded by an RTP header; RTP header and data are in turn contained in a UDP packet. The RTP header indicates what type of audio encoding (such as AMR, AMR-WB, PCM, ADPCM or LPC) is contained in each packet so that senders can change the encoding during a conference, for example, to accommodate a new participant that is connected through a low-bandwidth link or react to indications of network congestion.
If both audio and video media are used in a conference, they are normally transmitted as separate RTP sessions. That is, separate RTP and RTCP packets are transmitted for each medium using two different UDP port pairs and/or multicast addresses. There is no direct coupling at the RTP level between the audio and video sessions, except that a user participating in both sessions should use the same distinguished (canonical) name in the RTCP packets for both so that the sessions can be associated.
One motivation for this separation is to allow some participants in the conference to receive only one medium if they choose. Despite the separation, synchronized playback of a source's audio and video can be achieved using timing information carried in the RTCP packets for both sessions.
RTP packet is a data packet consisting of the fixed RTP header, a possibly empty list of contributing sources, and the payload data. RTP payload is the data transported by RTP in a packet, for example audio samples or compressed video data. Some underlying protocols may require an encapsulation of the RTP packet to be defined. Typically one packet of the underlying protocol contains a single RTP packet, but several RTP packets may be contained if permitted by the encapsulation method.
The RTP control protocol (RTCP) is based on the periodic transmission of control packets to all participants in the session, using the same distribution mechanism as the data packets. The underlying protocol should normally provide multiplexing of the data and control packets, for example using separate port numbers with UDP. RTCP performs four functions:
1. The primary function is to provide feedback on the quality of the data distribution. This is an integral part of the RTP's role as a transport protocol and is related to the flow and congestion control functions of other transport protocols. The feedback may be directly useful for control of adaptive encodings, but experiments with IP multicasting have shown that it is also critical to get feedback from the receivers to diagnose faults in the distribution. Sending reception feedback reports to all participants allows one who is observing problems to evaluate whether those problems are local or global. With a distribution mechanism like IP multicast, it is also possible for an entity such as a network service provider who is not otherwise involved in the session to receive the feedback information and act as a third-party monitor to diagnose network problems. This feedback function is performed by the RTCP sender and receiver reports.2. RTCP carries a persistent transport-level identifier for an RTP source called the canonical name or CNAME. Since the SSRC identifier may change if a conflict is discovered or a program is restarted, receivers require the CNAME to keep track of each participant. Receivers may also require the CNAME to associate multiple data streams from a given participant in a set of related RTP sessions, for example to synchronize audio and video. Inter-media synchronization also requires the NTP and RTP timestamps included in RTCP packets by data senders.3. The first two functions require that all participants send RTCP packets, therefore the rate must be controlled in order for RTP to scale up to a large number of participants. By having each participant send its control packets to all the others, each can independently observe the number of participants. This number is used to calculate the rate at which the packets are sent.4. A fourth, optional function is to convey minimal session control information, for example participant identification to be displayed in a user interface of a terminal. This is most likely to be useful in loosely controlled sessions where participants enter and leave without membership control or parameter negotiation. RTCP serves as a convenient channel to reach all the participants, but it is not necessarily expected to support all the control communication requirements of an application. A higher-level session control protocol, which is beyond the scope of this document, may be needed.
An RTCP packet is a control packet consisting of a fixed header part similar to that of RTP data packets, followed by structured elements that vary depending upon the RTCP packet type. Typically, multiple RTCP packets are sent together as a compound RTCP packet in a single packet of the underlying protocol; this is enabled by the length field in the fixed header of each RTCP packet.
Next, some information about speech codecs for VoIP services will be provided. There are basically two types of speech codecs for VoIP services in 3GPP networks in use: Adaptive Multi-Rate Codec (AMR) and Adaptive Multi-Rate Wideband Codec (AMR-WB).
The Adaptive Multi-Rate (AMR) Speech Codec was originally developed and standardized by the European Telecommunications Standards Institute (ETSI) for GSM cellular systems. It is now chosen by the Third Generation Partnership Project (3GPP) as the mandatory codec for third generation (3G) cellular systems. The AMR codec is a multi-mode codec that supports 8 narrow band speech encoding modes with bit rates between 4.75 and 12.2 kbps. The sampling frequency used in AMR is 8000 Hz and the speech encoding is performed on 20 ms speech frames. Therefore, each encoded AMR speech frame represents 160 samples of the original speech. Among the 8 AMR encoding modes, three are already separately adopted as standards of their own. Particularly, the 6.7 kbps mode is adopted as PDC-EFR, the 7.4 kbps mode as IS-641 codec in TDMA, and the 12.2 kbps mode as GSM-EFR.
For AMR the maximum RTP payload size (encapsulating 1 speech frame encoded at 12.2 kbps into one RTP packet) is 35 bytes. When adding RTP/UDP/IPv4 or RTP/UDP/IPv6 (no IPv6 header extensions assumed) header, the maximum SDU sizes are respectively 75 bytes and 95 bytes.
When Robust Header Compression (ROHC) is used the maximum SDU sizes for AMR are 41 bytes and 42 bytes, respectively, when adding compressed RTP/UDP/IPv4 or RTP/UDP/IPv6 headers. For AMR-WB the maximum SDU sizes are 70 bytes and 71 bytes, respectively, when adding compressed RTP/UDP/IPv4 or RTP/UDP/IPv6 headers.
The Adaptive Multi-Rate Wideband (AMR-WB) speech codec was originally developed by 3GPP to be used in GSM and 3G cellular systems. Similar to AMR, the AMR-WB codec is also a multi-mode speech codec. AMR-WB supports 9 wide band speech coding modes with respective bit rates ranging from 6.6 to 23.85 kbps. The sampling frequency used in AMR-WB is 16000 Hz and the speech processing is performed on 20 ms frames. This means that each AMR-WB encoded frame represents 320 speech samples.
For AMR-WB the maximum RTP payload size (encapsulating 1 speech frame encoded at 23.85 kbps into one RTP packet) is 64 bytes. When adding RTP/UDP/IPv4 or RTP/UDP/IPv6 (no IPv6 header extensions assumed) header, the maximum SDU sizes are respectively 104 bytes and 124 bytes.
124 bytes is therefore the maximum packet size for speech traffic.
The multi-rate encoding (i.e., multi-mode) capability of AMR and AMR-WB is designed for preserving high speech quality under a wide range of transmission conditions. With AMR or AMR-WB, mobile radio systems are able to use the available bandwidth as effectively as possible. E.g., in GSM it is possible to dynamically adjust the speech encoding rate during a session so as to continuously adapt to the varying transmission conditions by dividing the fixed overall bandwidth between speech data and error protective coding to enable the best possible trade-off between speech compression rate and error tolerance. To perform mode adaptation, the decoder (speech receiver) needs to signal to the encoder (speech sender) the new mode it prefers. This mode change signal is called Codec Mode Request or CMR.
Since in most sessions speech is sent in both directions between the two ends, the mode requests from the decoder at one end to the encoder at the other end are piggy-backed over the speech frames in the reverse direction. In other words, there is no out-of-band signaling needed for sending CMRs.
Every AMR or AMR-WB codec implementation is required to support all the respective speech coding modes defined by the codec and must be able to handle mode switching to any of the modes at any time. However, some transport systems may impose limitations in the number of modes supported and how often the mode can change due to bandwidth limitations or other constraints. For this reason, the decoder is allowed to indicate its acceptance of a particular mode or a subset of the defined modes for the session using out-of-band means.
For example, the GSM radio link can only use a subset of at most four different modes in a given session. This subset can be any combination of the 8 AMR modes for an AMR session or any combination of the 9 AMR-WB modes for an AMR-WB session.
Moreover, for better interoperability with GSM through a gateway, the decoder is allowed to use out-of-band means to set the minimum number of frames between two mode changes and to limit the mode change among neighbouring modes only.
Both the above described codecs support voice activity detection (VAD) and generation of comfort noise (CN) parameters during silence periods. Hence, the codecs have the option to reduce the number of transmitted bits and packets during silence periods to a minimum. The operation of sending CN parameters at regular intervals during silence periods is usually called discontinuous transmission (DTX) or source controlled rate (SCR) operation. The AMR or AMR-WB frames containing CN parameters are called Silence Indicator (SID) frames.
The term silence does not necessarily mean absolute silence but it is a situation in which the level of voice falls so low that the voice activity detection fails, i.e. the codec determines that there is no speech to encode.
The Internet, like other packet networks, occasionally loses and reorders packets and delays them by variable amounts of time. To cope with these impairments, the RTP header contains timing information and a sequence number that allow the receivers to reconstruct the timing produced by the source, so that, for example, chunks of audio are contiguously played out the speaker every 20 ms. This timing reconstruction is performed separately for each source of RTP packets in the conference. The sequence number can also be used by the receiver to estimate how many packets are being lost.
The AMR and AMR-WB payload format supports several means, including forward error correction (FEC) and frame interleaving, to increase robustness against packet loss.
The simple scheme of repetition of previously sent data is one way of achieving FEC. Another possible scheme which is more bandwidth efficient is to use payload external FEC, e.g., RFC2733, which generates extra packets containing repair data. The whole payload can also be sorted in sensitivity order to support external FEC schemes using UEP.
With AMR or AMR-WB, it is possible to use the multi-rate capability of the codec to send redundant copies of the same mode or of another mode, e.g., one with lower-bandwidth.
AMR or AMR-WB Speech Over IP
A conversational service puts requirements on the payload format. Low delay is one very important factor, i.e., few speech frame-blocks per payload packet. Low overhead is also required when the payload format traverses low bandwidth links, especially as the frequency of packets will be high. For low bandwidth links it also an advantage to support UED which allows a link provider to reduce delay and packet loss or to reduce the utilization of link resources.
A Streaming service has less strict real-time requirements and therefore can use a larger number of frame-blocks per packet than conversational service. This reduces the overhead from IP, UDP, and RTP headers. However, including several frame-blocks per packet makes the transmission more vulnerable to packet loss, so interleaving may be used to reduce the effect packet loss will have on speech quality. A streaming server handling a large number of clients also needs a payload format that requires as few resources as possible when doing packetization. The octet-aligned and interleaving modes require the least amount of resources, while CRC, robust sorting, and bandwidth efficient modes have higher demands.
Another scenario occurs when AMR or AMR-WB encoded speech will be transmitted from a non-IP system (e.g., a GSM or a circuit switched 3GPP network) to an IP/UDP/RTP VoIP terminal, and/or vice versa.
In such a case, it is likely that the AMR or AMR-WB frame is packetized in a different way in the non-IP network and will need to be re-packetized into RTP at the gateway. Also, speech frames from the non-IP network may come with some UEP/UED information (e.g., a frame quality indicator) that will need to be preserved and forwarded on to the decoder along with the speech bits.
A third likely scenario is that IP/UDP/RTP is used as transport between two non-IP systems, i.e., IP is originated and terminated in gateways on both sides of the IP transport.
AMR and AMR-WB RTP Payload Formats
The AMR and AMR-WB payload formats have identical structure, so they are specified together. The only differences are in the types of codec frames contained in the payload. The payload format consists of the RTP header, payload header and payload data.
The duration of one speech frame-block is 20 ms for both AMR and AMR-WB. For AMR, the sampling frequency is 8 kHz, corresponding to 160 encoded speech samples per frame from each channel. For AMR-WB, the sampling frequency is 16 kHz, corresponding to 320 samples per frame from each channel. Thus, the timestamp is increased by 160 for AMR and 320 for AMR-WB for each consecutive frame-block.
Payload Structure
The complete payload consists of a payload header, a payload table of contents, and speech data representing one or more speech frame-blocks.
Transmission of RTP and RTCP Packets
The basic problem in VoIMS is given by the uncontrolled nature of the RTCP traffic, and its possible impact on the RTP traffic, which carries voice data. FIG. 2 shows the situation. In this situation, RTP/UDP/IPv6 headers of the RTP packets are compressed using ROHC RTP/UDP/IP profile, and the UDP/IPv6 headers of the RTCP packets using ROHC UDP/IP profile.
The FIG. 2 shows that, normally, the length of RTCP packets is much larger than the length of RTP packets. Every RTP packet is sent during one 20 ms Transmission Time Interval (TTI). The transmission of one RTCP packet covers multiple transmission time intervals. Since the transmission of RTP and RTCP occurs on the same radio bearer, RTCP packets may cause RTP packets to be delayed or even lost (depending on the RLC discard timer). Ultimately, this produces impairment of the perceived speech quality.
In the above described example it is assumed that the bearer is dimensioned for (maximum) 12.2 kbps AMR mode (RTP payload 32 bytes), so that there is room for ROHC First Order (FO) header and PDCP header, together max. 9 bytes.
It is noted here that the maximum size of the FO header depends on the ROHC implementation. Also, occasional ROHC feedback headers may increase the size of the ROHC header. The dimensioning of the bearer may be somewhat higher or lower, depending on the assumed ROHC header size and depending on the allowed delay.
The example presented the case in UTRAN with usage of Robust Header Compression. The same conclusions can be drawn without usage of ROHC. A similar situation holds also in GERAN networks: instead of the TTI concept, there is a fixed number of time slots reserved for the transmission of the header compressed RTP packet once in 20 ms (e.g., one time slot in each of the consecutive 4 or 5 TDMA frames of 4.615 ms duration).
Several alternatives have been considered recently to overcome the problem presented in the previous section. One proposed alternative suggests of removal of RTCP for VoIMS application contexts i.e. RTCP packets are not transmitted at all in the VoIMS application context. Another proposed alternative suggests that RTP packets and RTCP packets are carried over separate PDP contexts and radio bearers. Yet another proposed alternative relates to RTP frame stealing. This means that RTCP may be prioritised over RTP. In other words, the RTCP packets have higher transmission priority than RTP packets.
It is recognized that the two first proposed alternatives do not lead to interoperable or efficient solutions. The first alternative, among the other things, causes interworking problems when the VoIP endpoints (terminals on endpoints of the conversation) are connected to different communication networks, for example, between wireless 3GPP network and a network supporting IETF standards (for example, internet). The second alternative produces at least an increase of the number of used PDP contexts (at least two for each session) and inaccurate Round Trip Time (RTT) computations. The third solution leads to the increase of the speech frame error rate (FER), since speech/silence InDication (SID) packets are discarded when RTCP is prioritised over RTP.
In addition to the previous alternatives, there are also other proposals, currently discussed in 3GPP RAN WG2 to overcome the problem in UTRAN:
4. Segmentation and concatenation over the radio interface
5. RB/TrCH/PhyCH Reconfiguration
6. Allocation of secondary scrambling code
These methods are primarily for the downlink only, where the number of orthogonal spreading codes is limited. It is assumed that in uplink the bearer can be over-dimensioned. However, problems appear also in uplink, if the RTCP packets are larger than assumed.
About the fourth method, it is mentioned in reference [10] (the full citation for this reference and other prior art references are presented in Appendix A) that “this mechanism requires delaying of some of the RTP packets for the transmission of the RTCP packets to be completed.” . . . . “The net result is the additional delay and the delay variation (jitter) imposed on RTP (voice) packets, which is not desirable.”
According to [10], the drawback of the fifth method is that the “mechanism relies on the radio interface to reconfigure the bearer used for IMS voice to allow higher bandwidth during the transport of RTCP packets. However such reconfiguration could take multiple 100 s of milliseconds and such a large amount of delay imposed on voice service is also not desirable.”
The drawback of the sixth method is an increased interference, as mentioned in reference [11]. The interference is dependent on various factors, e.g., the interference may increase drastically when the number of simultaneous connections with a secondary scrambling code gets higher.
Also the possibility to separate RTCP and RTP over different radio bearers (even though they are on the same PDP context and, hence, on the same RAB) has been discussed in reference [12]. The drawback of this solution is that a certain amount of resources needs to be constantly reserved for the two bearers. This amount of resources is typically higher than in the case of one bearer, due to lower multiplexing gain of two separate bearers.
In general, the three above-mentioned radio access level solutions are, at their best, only partial solutions: they are specific to UTRAN (i.e. not applicable to GERAN, e.g., the usage of the secondary scrambling code), and/or they are not applicable in legacy networks (e.g. the reconfiguration).
And even though some of the above-mentioned solutions were used, unpredictable, large size of RTCP packets would cause in most cases unexpected phenomena, e.g., loss of RTP packets.