Voice over IP is a convergence between the telecom and datacom world, wherein the speech signals are carried by the data packets, e.g. Internet Protocol (IP) packets. The recorded speech is encoded by a speech codec on a frame-by-frame basis. A data frame is generated for each speech frame. One or several data frames are packed into RTP packets. The RTP packets are further packed into UDP packets and the UDP packets are packed into IP packets. The IP packets are then transmitted from the sending client to the receiving client using an IP network.
A problem associated with packet based networks is delay jitter. Delay jitter implies that even though packets are transmitted with a regular interval, for example one frame every 20 ms, the packets arrive irregularly to the receiver. Packets may even arrive out of order. The most common reasons for receiving packets out-of-order is because the packets travel different routes, at least for fixed networks. For wireless networks, another reason may be that re-transmission is used. For example: When sending packet N on the uplink (i.e. from the mobile terminal to the base station) there may be bit errors that cannot be corrected and re-transmission has to be performed. However, the signaling for retransmissions may be so slow that the next packet in the queue (packet N+1) is sent before packet N is re-transmitted. This may result in that the packets are received out-of-order if packet N+1 was correctly received before the re-transmitted packet N is correctly received.
In VoIP clients, a jitter buffer means is used to equalize delay jitter in the transmission so that the speech samples can be played out at a constant sampling rate, for example one frame every 20 ms. (Play out is in this description used to indicate the transmission of the speech to the sound card.) The fullness level of the jitter buffer means is proportional to the amount of the delay jitter in the packet flow and the objective is to keep the amount of late losses at an acceptable level while keeping the delay as low as possible. The following example explains the importance of keeping the delay as low as possible: Long buffering time in the jitter buffer means increases the end-to-end delay. This reduces the perceived conversational quality because the system will be perceived as “slow”. Long delays increases the risk of that the users talk at the same time and may also give the impression that the other user is “slow” (thinking slowly). Further, a late loss is a packet that is properly received but that has arrived too late to be useful for the decoder.
The jitter buffer means stores packets or frames for a certain time. A typical way of defining this is to say that the jitter buffer means is filled up to a certain “level”, denoted the fullness level. This level is often measured in milliseconds instead of the number of frames since the size of the frames may vary. Thus the jitter buffer means level is measured in time. The jitter buffer means level can be set in a number of different ways.
Fixed size: The fixed size implies that the jitter buffer fullness level is fixed and pre-configured. After a DTX period, the jitter buffer means is initially filled up with a fixed time e.g. a fixed number of frames (e.g. 5 frames) before speech play-out is resumed. This initial margin is used to give a protection against delay jitter and late loss.
Adaptive jitter buffer means size: The jitter buffer fullness level varies with the delay jitter. Similarly to the case of fixed size of the jitter buffer fullness level, an initial number of frames are buffered up before speech play-out is resumed after a DTX period. However, during the active voice (non-DTX) period the fullness level of the jitter buffer means may vary, based on analysis of the incoming packets. It is possible to collect the statistics over several talk spurts. However, one usually reset the jitter buffer fullness level to the “default level” at every speech onset.
Adaptive jitter buffer means size with improved interactivity: In order to reduce the perceived delay, it is possible to initialize the jitter buffer means with a shorter time than for case with adaptive jitter buffer means size and the speech play-out is started as soon as the first speech packet is received after DTX. In order to reach the jitter buffer fullness level, time scaling is used to stretch the initial decoded frames so that the packets are extracted from the jitter buffer means at a reduced pace. Time scaling implies that the speech frames are played out adaptively, i.e., that a speech frame that normally contains 20 msec of speech may be stretched and 30 msec of speech is generated. An alternative to start play-out after the first received packet is to wait one or two extra packets. WO-200118790 A1 and US2004/0156397 A1 describe time scaling.
DTX is discontinuous transmission and implies that a special type of information is transmitted on the channel when no voice is present and the input signal contains only (background) noise. The encoder evaluates the background noise and determines a set of parameters that describes the noise (=Silence Description, SID, parameters). The SID parameters are transmitted to the receiving terminal so that a similar noise, comfort noise, can be generated. The SID parameters are transmitted less frequently than normal speech frames in order to save power and transmission resources.
Turning now to FIG. 1 showing an example of initial jitter buffer means operation according to the method of the adaptive jitter buffer means size with improved interactivity. The upper plot shows the jitter buffer fullness level and the lower plot shows frame size. The play-out is started as soon as the first packet is received, at about 0.5 seconds. Time scaling is performed to increase the size of the generated frames and thereby consume frames at a slower than normal pace from the jitter buffer means. The early start of the play-out gives a feeling of improved interactivity which increases the perceived conversational quality. In the end of the talk-burst, at about 3 seconds, the last speech frames are shortened and played out at a faster pace than normally. This gives a further improved interactivity.
Note that the adaptation of the target jitter buffer means level (60 ms) during the non-DTX period is not shown in FIG. 1, however this functionality will exist in a typical implementation of the adaptive jitter buffer means size with improved interactivity.
There are however several drawbacks with the three methods described above. The fixed jitter buffer means size, gives a quite long delay since a number of packets are always buffered before the play-out starts. This reduces the perceived interactivity.
The adaptive jitter buffer means may adjust the fullness level in order to introduce less delay on average, at least if the channel is varying slowly. The problem with poor interactivity due to long initial buffering time still remains since the purpose with the adaptation is to adapt within an ongoing packet flow during active speech when the flow starts up after a DTX period. It should be noted that this problem occurs if the jitter buffer fullness level is reset to a default level at every speech onset (i.e. at the switching from DTX to speech).
The jitter buffer means initialization, when using the adaptive jitter buffer means size with improved interactivity, improves the interactivity as the perceived initial delay will be lower. One problem is however that the jitter buffer means level is very low in the beginning of a speech burst and there is therefore a risk that delay jitter in the beginning of speech bursts results in late losses. Similarly to frame losses, late losses will reduce the speech quality since the error concealment is activated for the frame that is lost or is received late.
Additionally, the method of the adaptive jitter buffer means size with improved interactivity also implies that the time scaling, to adjust the buffer level up to the normal fullness level, must be done quite fast since the adaptation period must be short enough to avoid being hit by multiple delay spikes. A delay spike is when the delay increases substantially from a first packet to a subsequent packet. This means that the time scaling must be quite aggressive. Aggressive time scaling increases the risk that the time scaling itself introduces distortions. The distortions may be of different kind, clicks, plops, bursts of noise, but also “funny sounding sound” like “unnatural talking amount”.
For most modern speech codecs (GSM-EFR, GSM-AMR, ITU-T G.729, EVRC, etc), that use inter-frame prediction to be able to encode the signal at a lower bit rate but with maintained quality, there is an additional problem. Both frame losses and late losses give distortions for the current frame and also for subsequent frames since the error propagate for some time due to the inter-frame prediction. The error propagation time depends on the sound and the codec but may be as long as 5-6 frames (100-120 ms). Late losses are especially critical in the beginning of speech burst as these parts often contain voiced onsets, which are later used by the adaptive codebook to build up the voiced waveform. The result of a late loss in the beginning of a speech burst is therefore often very audible and can degrade intelligibility considerably.
There are a few methods to compensate for the error propagation that would occur if a late loss occurs during the build-up time, but they all have significant drawbacks. One possibility is to reduce initial buffering time but not as much as could be done in the optimum case. This would, of course, mean that it is not possible to benefit that much, in terms of interactivity, as it would be desired to.
Another possibility is to reduce of the amount of inter-frame prediction used in the codec. This would however either result in a reduced intrinsic speech quality, since the inter-frame correlation is not exploited to its full potential, or require that the signal is encoded at a higher bit rate, or both.
Due to the drawbacks with the method of adaptive jitter buffer means size with improved interactivity, the method is difficult to use in real systems. For channels that contains very little jitter and preferably also few packet losses it may work well but for channels that contains a lot of jitter and possibly also gives packet losses it is very difficult to get the full gain in improved interactivity. For most practical cases, it would be preferable to have an initialization time of a few frames before the play-out starts.