Practically all modern telephony applications use speech compression to increase the efficiency with which the transmission media are used. The functional entity that performs the compression is called a speech codec. Most of the modern speech codecs operate by processing the speech signal in short segments called frames. For instance, all GSM (global system for mobile communications) codecs, including the AMR (adaptive multi-rate) codec, use 20 ms frames.
One commonly known property of a telephone link is that it is very sensitive to the delay introduced by the transmission of speech data transmission time from sender to receiver and back (so called round trip delay). Practical tests have shown that even relatively short round trip delay (around 400 ms) degrades the interactivity of the discussion, and round trip delays over 800 ms are found to reduce the quality of Service (QoS) to an unacceptable level. Therefore, generally a telephony system should be designed in such a way that the maximum round trip delay can be limited below a predetermined threshold so as to provide predictable and acceptable quality.
Traditional telephony services use the circuit switched (CS) approach. This means that the parties to the connection communicate over a transmission channel that is reserved for the whole duration of the communication. This implies that the data is transmitted over a fixed route, and also the transmission time is fixed and predictable. Therefore, this kind of telephone network can offer reliable service with controlled QoS. An important group of applications employing CS telephone services are some cellular mobile systems, e.g. GSM.
On the other hand, the emergence of the Internet has created a new platform for telephony applications: There are already a number of telephony applications which use packet switched (PS) networks (such as the Internet) to transmit speech data. Most, although not all, PS networks are based on IP (Internet protocol) protocols (like the Internet) and telephony applications running on this kind of networks are referred as IP telephony or Voice-over-IP (VoIP). The basic idea of a PS network is that the transmitted data is decomposed into small sub-blocks called packets, and the receiving application uses received packets to recompose the original data. Each packet can be transmitted from source to destination independently of other packets, and it is up to the network to route packets from source to destination. This implies that it is quite possible that the packets belonging to the same stream will use different routes to reach the destination. Furthermore, in general a PS network provides only a so-called ‘best effort’ service: the packets are transmitted from source to destination without any guarantees about the QoS. Therefore, it is possible that some of the packets are lost during transmission, and the time required for the transmission from source to destination is in the general case unpredictable. Due to varying load in the network and possibly also to different transmission paths of the packets, the transmission delay can vary from packet to packet within a stream. This variation in transmission time is called jitter. Considering the Internet in general, the transmission delay can vary from a negligible level to even several seconds. The same applies also to jitter, although usually the transmission time and jitter are related: in many cases long transmission time also means large jitter. This unpredictable delay behaviour is likely to cause quality problems for VoIP services. However, in a relatively small and closed IP network, such as a company LAN (Local Area Network), the delay and jitter can often be limited to a desired range by network design and by controlling the amount of traffic that is allowed into the network.
As an example, in the current GSM system the CS approach has been extended to cover data services over a CS radio channel. Because of the narrow bandwidth offered by the radio system (which was originally designed for speech services), the data rates offered are relatively low. In spite of this, these services have gained popularity, and rapid advances in radio technology are expected to significantly increase available data rates in the near future. On the other hand, the Internet offers a vast range of services, and therefore it would be appealing to combine these ‘two worlds’ to extend the coverage of the ‘Internet services’ also to mobile users. The convergence is also appealing from the telephony point of view, the scenario being that of a connection between a terminal in a cellular mobile (radio) network and a terminal in a VoIP domain.
One proposed system would include both CS and PS radio access networks (RANs), together with a PS core network (CN). Furthermore, the CN part of the network could be connected to an external PS network (such as the Internet or a company LAN) through a gateway (GW), thus enabling a connection to a terminal connected to this external network via its own access network (AN) This could conceivably enable seamless and transparent connection between terminals anywhere within reach of a concatenation of networks. FIG. 1 presents a greatly simplified illustration of this arrangement.
In a PS network, speech frames are typically transmitted using the Real-time Transport Protocol (RTP) packets. (See IETF RFC 1889 “RTP: A Transport Protocol for Real-Time Applications”, 1996). Furthermore, RTP is typically run over User Datagram Protocol (UDP) and IP. (See IETF RFC 768 “User Datagram Protocol”, 1980). GSM speech frames can be encapsulated into RTP packets according to the standard specified in ETSI TS 101 318 “Telecommunications and Internet Protocol Harmonization Over Networks (TIPHON); Using GSM speech codecs within ITU-T Recommendation H.323”, v1.1.1, 1998. Currently, the IETF is also working on specifying a method to encapsulate AMR speech frames into RTP. This will be an important specification for 3G work, since the AMR codec has been selected to be the only mandatory speech codec for 3G systems.
The RTP Control Protocol (RTCP) is an integral part of the RTP specification Whenever RTP packets are used, RTCP packets should also be transmitted. (See IETF RFC 1889 “RTP: A Transport Protocol for Real-Time Applications”, 1996). RTCP is used to monitor quality of service and to give information about the participants. RTCP packets are transmitted periodically, typically less often than RTP packets to save bandwidth (see section 6.2 of the IETF RFC 1889).
In the communication situation described above (and illustrated in FIG. 1), radio bandwidth is arguably the most scarce resource on a path from a fixed VoIP terminal to a mobile terminal in a cellular network. Furthermore, transmission over a RAN is likely to introduce a considerable amount of delay. Therefore, the radio link can be regarded as the ‘bottleneck’ within this connection, and it would be advantageous to try to optimise the use of radio band.
The efficient use of radio bandwidth requires strict scheduling of transmitted data, and this usually means that radio frames must be transmitted at fixed intervals. Furthermore, efficient radio transmission usually also implies that the data from different sources (‘logical channels’) is transmitted on the same radio block (‘physical channel’). In pure CS environments this normally does not have any effect on the performance/delay of the system. On the other hand, the entity controlling the radio transmission timing does not have any control over transmission times of a terminal that is located in the PS VoIP domain. Transmission over the external PS domain is asynchronous, and in this kind of case the frames from different sources scheduled for radio transmission in the same radio block arrive at the RAN at different times and have to be buffered to wait for further transmission over the radio link.
FIG. 2 shows schematically the arrangement of a GSM mobile station, BTS (Base Transceiver Station) and BSC (Base Station Controller). The GSM mobile is connected to via radio interface to a ATS. Speech frames are transmitted between BTS and BSC in TRAU (Transcoder/Rate Adaptor Unit) frames. Speech frames are encoded/decoded in the TRAU unit, which is typically located in the BSC. Delay between GSM mobile and BSC may change during a call, since:                1. the time slot may change,        2. the GSM mobile may change from one BTS to another BTS inside the BSC area.        
Normally, TRAU frames are transmitted every 20 ms. However, it is possible to change the length of the TRAU frames (and thus the transmission period) by changing the number of stop-bits located at the end of the TRAU frame.
To handle uplink timing, the BTS sends TRAU frames when those are received from the radio channel. The TRAU unit located in the BSC decodes the TRAU frames to speech samples, which are sent to the PCM line. Since the sampling interval is fixed in the PCM line, the TRAU unit can skip or repeat speech samples to adjust the timing in case the arrival of a TRAU frame differs from the nominal frame length 20 ms.
To handle downlink timing, the BTS sends TRAU frames to the radio channel at fixed intervals depending on timing in the radio channel. At the beginning of the call BSC has no information about timing at the BTS. Additionally, if the time slot or the BTS changes, the optimal timing changes too. To adjust the timing, BTS sends timing information to the BSC. According to this time alignment information, the BSC adjusts transmission time of the downlink TRAU frames. Again, transmission time can be adjusted by repeating or skipping PCM speech samples.
The above mentioned timing method is explained in detail in GSM 08.60 “Digital cellular telecommunications system (Phase 2+); In-band control of remote transcoders and rate adaptors for full rate traffic channels”, v8.1.0, 1999 at chapter 4.6.1 “Time Alignment of the speech service frames”.