The new 3GPP EVS codec was originally standardized for Enhanced Voice Services (EVS) in the Evolved Packet System (EPS) with LTE (Long Term Evolution) access, i.e. for application in an IP environment and with the IMS (IP Multimedia Subsystem) as Application Core Network. This means that the speech data is transmitted in IP packets. The transmission of the packets is prone to delay jitter and packet loss. The EVS encoder operates like many other speech and audio codecs on signal frames of 20 ms length and generates a set of coded parameters for each frame. These parameter sets are also referred to as coded speech or data frames. The EVS decoder expects to receive these frames at the same rate of one set each 20 ms and then decodes them to the reconstructed output signal. Input and output signals to the encoder and from the decoder are 16 bit linear PCM (Pulse Code Modulation) encoded waveforms, sampled at 8, 16, 32 or 48 kHz.
The transmission of the speech data packets in a packet-switched (PS) system like the EPS using the RTP/UDP/IP (Real-time Transport Protocol/User Datagram Protocol/Internet Protocol) protocols means that the packets (each containing one or several coded speech frames) may arrive at a receiver asynchronously, i.e. at irregular time instances. This is especially the case in the LTE radio access network, but also in other access networks, like WiFi. An essential receiver component is hence a de-jitter buffer (often referred to as jitter buffer) that accepts the asynchronously arriving packets, stores them or the contained speech data frames, and conveys them at regular time intervals to a synchronously operating decoder. The decoder may for instance be the EVS decoder that requires speech data frames at a constant frame rate of 20 ms. Depending on the amount of delay jitter, the depth of the jitter buffer needs to be chosen such large to ensure that even speech frames arriving late can still be propagated to the speech decoder at the time instant when they are needed. On the other hand, the jitter buffer depth should be as small as possible in order to keep the speech path delay as short as possible, i.e. the speech delay from sending end to the receiving end. The longer the speech path delay in a speech conversation, the more the conversational quality will be affected. If the jitter buffer depth is too small, the likelihood increases that a coded speech frame is not available when it needs to be provided to the speech decoder. Hence, such frames are effectively lost and are correspondingly signaled as lost or erased frames to the decoder. The decoder then applies frame loss concealment, meaning that an artificial frame for the lost speech frame is generated such that the loss is as inaudible as possible. If the late speech frame, declared as lost, arrives then at a later point in time it is usually discarded, but may also at the next frame instant be conveyed to the decoder for decoding (then the jitter buffer contents and the speech path delay increases by this frame).
It is to be noted that jitter buffers may also be deployed in network nodes with incoming packet-switched connections and outgoing connections that can either be circuit-switched (CS) or PS. The purpose is in any case the de-jittering of the asynchronously arriving data.
Jitter buffers may typically operate on frames. When the frames are arriving in packets they are first de-packetized and then the frames are placed into the jitter buffer at their proper time positions, according to their time stamps. If several frames are contained in a packet (which is a possibility with, e.g., the RTP payload format of the EVS codec according to 3GPP TS 26.445, Annex A), the time stamp of the RTP header apply only to the first frame contained in the packet. In that case the respective time stamps of the other included frames are then obtained by analyzing the RTP payload (i.e. the frames included in the packet). If the outgoing connection is also PS using RTP/UDP/IP, the frames taken out of the jitter buffer will be re-packetized. A jitter buffer may also operate based on RTP packets rather than on frames, especially in case of an outgoing PS connection.
Packet delay jitter is generally not the only cause of frame loss. Wireless transmission systems in particular, but even wireline systems may be prone to transmission errors. In transmissions using the RTP/UPD/IP protocols, packets affected by detectable errors are usually discarded. Likewise, there may be many other reasons why RTP/UPD/IP packets may not arrive at the receiver or do not arrive in time. In any case, in general the frames contained in such packets are lost and the jitter buffer may signal to the decoder that the corresponding frames are lost.
Frame loss concealment is only one technique to mitigate effects of frame loss. Another is to use forward error correction (FEC), which in a very general sense means adding redundancy to the transmitted information that allows the receiver to recover lost information or at least a part of it. In packet-based transmission systems using RTP/UPD/IP protocols application layer FEC is a known technique. One such technique is redundancy transmission in which a frame transmitted with one packet is re-transmitted within another packet. Hence, if the packet containing the primary frame data is lost, there is still possibility that the receiver gets a redundant copy of this data with another packet. In many realizations of redundancy transmission each packet contains a primary frame and the redundant copy of the data of an earlier frame. In case the packet with the primary frame gets lost but the jitter buffer in the receiver gets the equivalent redundant copy of the frame before it needs to be provided to the decoder, the loss will not have an effect. Partial redundancy is another flavor of redundancy transmission in which only the most important parameters (a part of all parameters) are sent in another packet, allowing the receiver to recover the lost frame in a better way.
The EVS codec standard comprises a complete RTP transmission frame work, including a jitter buffer management system and specifications for the RTP payload format. The decoder comprises an advanced frame loss concealment system. The EVS codec itself comprises a large number of operating modes, at various bit rates from 5.9 kbps (variable bit rate) to 128 kbps, and a multitude of audio bandwidth modes comprising narrowband (NB), wideband (WB), super-wideband (SWB) and fullband (FB).
A special feature of the EVS codec is its “channel-aware” operation mode (CA mode). In short, the CA mode is sending a partial redundant copy of the frame some packets later. It is described in sections 5.8.1 and 5.8.2 of specification 3GPP TS 26.445.
The operation of the CA mode is further explained with FIG. 1. FIG. 1 shows a sequence of received frames 10, where frame n 10a is due for decoding but is unavailable. Frames n+1 to n+5 have arrived and are queued in the jitter buffer. Each frame contains a primary portion 11 and a redundancy portion 13 for a previous frame that is displaced by the FEC offset. The FEC offset is provided as “RF frame offset” parameter 15 in each frame (RF=3 in the example). This parameter indicates the frame for which the redundancy portion is valid by means of displacement relative to the frame containing the redundancy. Hence, frame n+3 contains the partial redundant copy of the lost frame n, as indicated by RF=3.
The CA mode of the EVS codec can send the partial redundancy with 2, 3, 5 or 7 frames offset, i.e. 40, 60, 100 or 140 ms after the primary frame. The offset can be adapted such that when the packet loss rate is zero or low then no partial redundancy is sent, when the packet loss rate is higher but the losses occur mainly as single losses or few losses in a row then a short offset is used, for example offset 2 or 3, and when the packet loss rate is high and long loss bursts are detected then the offset is increased, e.g. to 5 or 7.
Using no partial redundancy or partial redundancy with a small offset allows for maintaining a short end-to-end delay when the operating conditions are good. However, as described above, this is useful only if the losses are well spread out over time. If long loss burst would occur, then the short offsets become unusable since both the primary encoding and the partial redundancy would be lost.
The longer offsets allow for maintaining good quality during periods with long loss bursts. However, the end-to-end delay will increase significantly. These offsets should therefore only be used when really needed. Otherwise, this would have a significant impact on the conversational quality.
To make the CA mode adaptive, the receiver evaluates the packet losses in the received media and decides if partial redundancy should be used and with which offset. The receiver then sends a Codec Mode Request (CMR) back to the sender, which changes the encoding to enable or disable the partial redundancy encoding and/or changes the offset as requested. This means that it takes (at least) a round-trip time before the receiver starts receiving packets according to the CMR that it sent.
A relevant description of the signaling parameters of the EVS CA mode is found in 3GPP TS 26.445. In particular, the coding of the FEC offset parameter (RF parameter) is detailed in the parts of the specification pertaining to the CA mode.
While the EVS codec has originally been standardized for packet-switched (PS) transmission systems, there are now standardization efforts ongoing targeting applications of the EVS codec in circuit-switched (CS) radio access systems, specifically UTRAN (UMTS Terrestrial Radio Access Network). The transmission in these CS radio access systems (as opposed to PS systems) is synchronous, i.e. coded speech frames are transmitted according to the 20 ms frame clock. As a consequence, coded speech frames arrive at the receiving end of the radio access without delay jitter and hence there is no need to use a jitter buffer in CS user equipments (UEs).
The fact that a CS radio access system transmits exactly at regular time intervals of e.g. 20 ms creates problems when receiving frames from a PS System in RTP packets with substantial delay jitter. According to the existing solution a jitter buffer is inserted in a network node (e.g. media gateway) between PS and CS systems. With the help of the jitter buffer, this network node propagates the available frames in a synchronous stream to the CS system. If a frame is lost, i.e. not present in the jitter buffer, when a sending time for the frame has come, then typically nothing is sent to the CS System, and the CS UE performs error concealment. This is also the case when the redundant secondary information is already inside the jitter buffer. One problem is that existing solutions do not and cannot take advantage of the CA Mode in this jitter buffer. The frames are just forwarded, with primary and (delayed) secondary information, to the CS system, but the secondary information, i.e. the redundancy portion, is not used.
The fact that an existing CS UE does not see delay jitter on its radio access means that it does not need a jitter buffer, and hence a jitter buffer is generally not implemented and not available in a CS UE. The term “CS UE” could refer to a UE that is not capable of PS radio access, but could also refer to a functionality for CS of a UE which is capable of both CS and PS radio access. In a CS UE, coded speech frames are typically decoded within less time than the duration of a frame (e.g. in less than 20 ms in case of EVS) after reception, to keep the speech path delay small. The consequence when using the EVS CA mode is that the partial redundancy data of the received speech frames will be useless in the CS UE, since the partial redundancy arrives too late to be useful for decoding. If, for instance, the CA mode is operated with a FEC offset of 3, then the partial redundant copy would arrive 3 frames (i.e. 60 ms) after the primary data needs to be provided to the decoder. Hence, the partial redundant copy is not of any use and the purpose of the CA mode to increase the robustness against frame loss cannot be achieved. On the contrary, transmission resources are wasted for the unusable partial redundant copies.
Another problem occurs e.g. in a transcoding-free inter-connect scenario between a 4G-UE (UE A) 200 residing in PS domain 220 (e.g. with LTE access) and a 3G-UE (UE B) 201 residing in CS domain 230. This scenario is shown in FIG. 2. The PS (IMS) domain 220 is terminated by ATGW (Access Transfer Gateway) A 202 (user plane (UP)) and ATCF (Access Transfer Control Function) A 203 (control plane). The UP data (coded EVS frames) contained in RTP/UPD/IP packets sent from 4G-UE, denoted “UE A” 200 in FIG. 2, will arrive at ATGW A 202 with possible delay jitter and loss. The ATGW A 202 propagates or forwards the packets to MGW (Media Gateway) B 204 residing in the CS Core network. Transmission from MGW B 204 onwards towards 3G-base station “nodeB” B (NB B) 205 in CS is using a synchronous Iu user plane protocol. The transmission between ATGW A 202 and MGW B 204 and any further node that may be in the speech path is typically PS-based, but may also be CS. Any of these nodes may comprise a de-jitter buffer and at least the last node, from which onwards a synchronous transmission protocol is used, has to comprise a de-jitter buffer to provide the regular, synchronous flow for the CS domain.
A problem occurs when using the CA mode in the call direction from 4G-UE A 200 to 3G-UE B 201 and may be explained by an example where the MGW B 204 performs de-jitter buffering. In case frame n is unavailable, e.g. lost or too late, when it is due for transmission on the synchronous interface, a Jitter Buffer Management (JBM) method would either not transmit any frame at all, or indicate a NO_DATA frame, or possibly repeat the previously received frame, or apply more sophisticated techniques to construct a valid speech frame from previously received frames. The decoder in the 3G-UE B 201 would either decode the frame, if it is a repeated previous frame or any valid speech frame, or it might generate an artificial frame using its frame loss concealment techniques. The frame containing the partial redundant copy arriving after the FEC offset time period would in any case be useless, and hence the situation would be as described above; the purpose of the CA mode to increase the robustness against frame loss cannot be achieved. Rather, the transmission resources used for the unusable partial redundant copies are wasted. The same problem occurs even if other jitter buffers in the speech path would replace unavailable frames (packets) by NO_DATA frames or repetition frames or by packets containing such frames.
FIG. 3 illustrates another problem that occurs in case a JBM in a network node inserts or removes frames in order to re-adjust the depth. This may happen in case of buffer overflow or underrun, respectively. An underrun, for instance, may cause the insertion of frame “i” 30 by the JBM. The consequence of this is that the FEC offset, indicated by the RF frame offset parameter included in the CA mode frames may become incorrect. The RF frame offset parameter becomes incorrect for all frames after the inserted frame, whose partial redundant copy is valid for a frame before the inserted frame. The analogue problem occurs in case of a frame deletion. The consequence of an incorrect FEC offset may be degraded quality when decoding a frame by using the partial redundant copy. This is since the partial redundant data are not valid for the frame for which they are decoded.