Communication networks used by computing devices, such as the Internet protocol (IP) network transport data in packets. Packets are bundles of data, organized in a specific way for transmission. A packet includes a header and a body. The body contains data and the header contains certain control information, including the destination address, the size of the packet, an error checking code, and so on. Data from a computing device is inserted into a packet and the packet is transmitted to another computing device that extracts and uses the data. For example, a computing device connected to a microphone may be used to record a spoken message and, using packets, transport the spoken message to a second computing device that plays back the spoken message through a speaker.
To transport a spoken message using packets, the spoken message is recorded as an analog audio signal. An analog to digital converter (ADC) is used to convert the audio signal to a digital signal. The digital signal is converted into coded binary data by a coder/decoder (codec). Encoding the binary data usually involves compressing the data. The binary data is broken into distinct frames and placed in a buffer. The packetizer extracts one or more frames from the buffer and places the frames into one or more packets. The packets are transmitted over a network to the play back computing device. A packet reader reads the packets and extracts one or more frames from the packets and places the frames into a buffer. The frames, are extracted from the buffer and the encoded binary data included in the frames is decoded and converted into a digital signal by a codec. The digital signal is converted to an analog audio signal by a digital to analog converter (DAC). The audio signal drives a speaker which reproduces the original spoken message.
Because communication networks are assemblies of physical devices, packets that are not lost take a finite amount of time to be delivered. The packet delivery time varies due to various sources of delay, such as, but not limited to, the physical distance packets travel over transmission lines, performance variations of the network routers and switches used to route the packets, and “clock drift,” the timing differences between computing devices that transmit and receive the packets. Depending on the number and types of delay sources a packet encounters while being transmitted, the duration of delays vary over time. The variation in the delay of packets is called “statistical dispersion” or less formally “jitter.” The more jitter in a network, the more difficult it is to maintain a constant packet delivery rate which, in turn, makes it more difficult to accurately reproduce an audio signal sent over the network.
Practically, jitter may be defined as the maximum packet delay minus the minimum packet delay over a short time period, e.g., a few milliseconds. The absolute value of the difference between the maximum packet delay and the minimum packet delay, i.e., the jitter, is not as important as having a buffer large enough to contain the number of packets received during the short time period, i.e., the short-term. Measuring jitter enables techniques for adapting an audio signal to accurately reproduce the audio output the signal represents. Preferably, signal adaptation is provided over the long-term, i.e., changes in the packet delay over a relatively long period of time, e.g., about a second. If the long term packet delay increases, the audio signal is expanded. If long term packet delay decreases, the audio signal is contracted. There are many ways to contract and expand audio signals. For example, to contract an audio signal, small segments of the signal that contain little or no useful information may be removed; to expand an audio signal, small segments of the signal may be copied and repeated.
Compensating for jitter by signal contraction or expansion must be done carefully and not to excess. If, for example, the audio signal encodes a person's voice and the audio signal is contracted too much, the audible speech produced may seem fast. If the same audio signal is expanded too much, the audible speech produced may seem slow. Thus, the adjustments made to compensate for jitter must be done slowly enough and carefully enough that the original speech is adequately reproduced.
Traditional methods for determining when to apply jitter compensation techniques, such as signal contraction and expansion, often require that the sources of jitter be measured, quantified, and recorded as values. The values are then used to determine when to apply techniques that compensate for the effects of jitter.