A so-called jitter buffer is a key component in any system that attempts to receive media, for example, audio and/or video being streamed over a packet-switched network for real time playback. Examples of such systems include a Voice over Internet Protocol (VoIP) system commonly used for real time media exchanges, such as voice or audiovisual conversations and conferences, for example, VoIP telephony or VoIP conferencing. Other systems include Voice over Long term Evolution (VoLTE) and Voice over Wi-Fi (VoWiFi) by way of, for example, an IP Multimedia Subsystem (IMS) network. In relation to a VoIP system, the receiver in such a system attempts to receive voice over a packet network such as the Internet in real time with low latency and high intelligibility. In such a system, packets may arrive out-of-order, packets may be lost, and there may be a variable delay of each packet causing jitter in the signal received caused by the network. The purpose of the jitter buffer is to compensate for the jitter introduced by the network in real time and to enable re-ordering of packets, without introducing too much latency, and without gaps or stutterings.
Consider voice communication as an example. Voice conversations typically occur in spurts, we call talk spurts, between which there is typically silence or only noise. The speech originating end records an input audio soundwave using at least one microphone, digitizes via an analog-to-digital converter (ADC) and codes the input audio signal to compress the data. It is common to divide the input signal into frames of digitized voice segments, for example, frames of 20 ms, and to packetize the frames into packets that each contain one or more frames, and additional information, including a packet sequence number in some systems and/or a frame timestamp such that a receiver can properly re-order the frames should they arrive out-of-order. Other information also may be included in or with a packet.
A common approach for exploiting the fact that there may be low voice activity in conversational speech is to classify the input signal as being of speech or of silence (silence including only background noise), for example, by using a Voice or Signal Activity Detector (VAD) to determine whether a frame is voice or silence. The frames determined to be silence can then be transmitted at reduced data rates. It will be appreciated that there may be other applications where speech detection is not as important within the audio signal. In such applications a VAD may be replaced by a Signal Activity detector (SAD)
The coding of the audio frames may use continuous transmission (CTX) according to which data frames are continuously transmitted, even during periods of speech inactivity, or may use discontinuous transmission (DTX), according to which transmission is discontinued in time intervals of speech inactivity to reduce the overall transmission power. The International Telecommunications Union (ITU) has published several standards for coding and transmission, including the G.722 standard according to which in CTX mode, when speech is absent, a lower rate coding mode is used to continuously encode the background noise. In DTX systems, the transmitter may be switched off during periods of speech non-activity. At the receiver side, to fill the gaps between talk spurts, a synthetic noise known as “comfort noise” may be generated, for example, using transmitted noise information.
In a packetized system, the media frames are packetized for sending between endpoints such that a sequence of packets is sent at a transmit endpoint. In a DTX system, two consecutive packets may have a period of silence between each other, or may be within the same talk spurt. As media packets traverse the network to an intended receiving endpoint, they experience a delay that depends, for example, on the respective route each may take, such that at a receiving endpoint, the packets arrive with different delays, possibly out-of-order, and with some packets lost or delayed by an amount that exceeds an acceptable level.
Consider a receiving endpoint, for example, a VoIP telephone or a VoIP bridge that includes a jitter buffer that uses jitter buffering. The packets arrive in a sequence which may or may not correspond to the sequence they were transmitted, and with different delays, causing what is known as jitter. The conventional approach to jitter buffering involves keeping a queue of packets to be played and picking the next packet for playback from an extraction point—the end of the queue upon each fetch for playback. At the start of each talk spurt, an insertion point into the playback queue is chosen such that the insertion point is some target latency that is forward of the current fetch point of the buffer (the head of the buffer). That is, silence compression or silence expansion is used to approach the target latency. This involves adding empty entries into the packet queue when the first packet of a talk spurt is received.
The target latency may be conventionally pre-determined by maintaining statistics, for example, a histogram of observations of network jitter and pre-setting the target latency to some high percentile of the jitter. For example, the target latency of the jitter buffer may be conventionally configured to track the 95'th percentile of network jitter. In the case of only counting conceals, this means that 5% of packets will arrive too late to be played out and the playback mechanism will include some signal processing to conceal the resulting gap in the media stream. Thus, conventionally, it is upon entry into the jitter buffer that a decision is required as to how to carry out silence compression or expansion to approach the pre-set target latency.