A telephony application enables transmission of real-time audio data over a packet-based network. To name a few, applications include voice over private Internet Protocol (IP) backbones, Internet or intranets, messaging, and streaming audio play, such as music or announcements. The most popular application is IP Telephony, that is, any telephony application that enables voice transmission via Internet Protocol (VoIP). This technology allows a device to transmit voice as just another form of data over the same IP network. For the purposes of this patent application, we also consider the audio transmissions in a video conference to be a form of IP Telephony. IP Telephony comprises numerous applications that support connections such as PC-to-PC connections, PC-to-phone connections, and phone-to-phone connections.
The crux of VoIP lies in converting an analog signal to digital IP packets (A/D), transmitting the IP packets over a network, and converting the IP packets back into a playable analog signal (D/A). At the transmitting end, a device generally digitizes the signal at a specific sampling rate, encodes that digital data into frames, converts the frames into IP packets, and transmits the IP packets over an IP network. At the receiving end, a device typically receives the packets, extracts the digital data from the packets, and converts the digital data into analog output at the same sampling rate as that used by the transmitter.
VoIP has both advantages and disadvantages when compared with traditional (e.g. PSTN) digital telephony systems. As for the advantages, the technology operates on the existing infrastructure, utilizing PSTN switches, customer premises equipment, and Internet connections. IP Telephony also improves the efficiency of bandwidth use for real-time voice transmission. And of particular interest, IP Telephony offers a new line of applications, combining real-time voice communication and data processing.
Regarding the disadvantages, VoIP and packet communication introduce issues of “reassembling” the packets, that is, playing the packets as if the packets were the original, continuous analog signal. Playing the IP packets appears simplistic; the receiving station could, upon receiving IP packets, convert the IP packets to an analog signal and immediately play the analog signal. Playing the packets upon reception, however, would resemble an accurate reconstruction only if the sender transmits the packets at uniform intervals, the packets transfer through the network without inconsistent delay, and the packets successfully reach the receiver. Each of these premises are often false. At times, starvation periods exist where the receiver has no packet to play, and at other times, burst periods overwhelm the receiver with too many packets to play. This non-uniformity is generally referred to as “jitter.”
Accordingly, to account for this “jitter,” most applications employ a buffer. A buffer loads incoming packets or frames to allow the receiver to retrieve and play the packets or frames at a uniform rate. The number of frames or packets in the buffer can fluctuate up and down with the network jitter. As long as the buffer never empties or overflows, the receiver will be able to play at its uniform rate, without audio disturbances. This buffering technique exists in most real-time media systems that receive audio or video from a network.
The buffer, however, cannot account for inconsistent sender transmission rate and receiver playback rate (or buffer output rate). In traditional digital telephony systems, a master clock synchronizes end points to ensure that the D/A and A/D converters at both ends operate at identical sampling rates. Identical sampling rates ensure that, on average, the data transmission rate will equal the receiver output rate. In contrast, in IP Telephony, no master clock exists to synchronize the sampling rates. In VoIP systems, it is common to employ personal computers, or similar hardware, with sound cards that have inaccurate sampling rates. Sound cards set at 8000 samples per second, for example, can actually have sampling rates that vary between 7948 and 8130 samples per second. For PC-based VoIP and videoconferencing systems, the clocks are not necessarily accurate enough to guarantee identical sampling rates. As a result, a receiver that operates at a slightly higher sampling rate will playback data faster than the sender transmits the data, ultimately emptying the buffer and requiring the receiver to play periods of “silence.” A receiver that operates at a slightly lower sampling rate will play data slower than the sender transmits the data. With the receiver steadily falling behind, the data will ultimately overwhelm the buffer, requiring the receiver to “discard” periods of playback data (frames or packets). Increasing the buffer size fails to remedy the problem because the concomitant delay between transmission and actual playback becomes unacceptable for real-time audio transmission.
A common solution is to insert “silent” periods when the buffer approaches depletion and to remove “silent” periods when the buffer approaches capacity. This solution has numerous flaws. From a hardware perspective, problems include detecting periods of silence and handling the requisite additional processing. From a user perspective, any inserting or deleting “silent” periods degrades the conversation, as no true periods of silence exist in VoIP applications. Therein lies the rub: the inherent difference between the human eye and ear. While a video frame may be left on display a split second longer than the next frame without human detection, a tone cannot simply be left playing. Accordingly, the prior art focuses on inserting sound periods or removing sound periods, seemingly the only suitable way to manipulate the flow rate of audio data in a real-time environment. See, e.g., U.S. Pat. No. 6,658,027 (“Jitter Buffer Management”).
The forgoing illustrates that during real-time audio transmission over a network a need exists to continually monitor the buffer and adjust the playback rate of a receiver to account for variances in sampling rates among transmitters and receivers.