This invention relates to voice-over-Internet-Protocol (VoIP) systems, and more particularly to control of audio data flow to and from a sound card.
Telephone calls can now use the Internet rather than traditional telephone lines. Voice-over-Internet-Protocol (VoIP) applications capture a user's voice, digitize and compress the voice, and transmit the coded voice as data inside Internet-protocol (IP) packets that can be sent over the Internet.
VoIP applications can be installed on personal computers (PC's), other devices connected to the Internet, or on translation servers such as Internet-to-Telephone gateways or Protocol Converters. Each party to a call runs a local copy or client of the VoIP application. When a PC is used, the VoIP application typically uses the existing sound card installed on the PC to play the remote caller's voice on a speaker, and to capture the local users voice from a microphone plugged into the sound card.
FIG. 1 is a diagram of a prior-art VoIP system. VoIP application A on PC 10 is operated by user A while VoIP application B on PC 12 is operated by user B at different nodes on the Internet. User A's speech is captured by a microphone plugged into a sound card in PC 10. The captured voice is digitized, coded, compressed, and fitted into IP packets by VoIP application A on PC 10. These IP packets containing user A's voice are routed over Internet 16 to VoIP application B on PC 12.
VoIP application B on PC 12 receives these IP packets, extracts and de-compresses the voice data, and sends the voice data to a sound card on PC 10 which generates audio signals to drive a speaker that plays the voice as audio to user B. User B's voice is then captured by a microphone attached to the sound card, converted to digital signals and coded, compressed, and fitted into IP packets by VoIP application B on PC 12. The IP packets containing user B's voice are also routed over Internet 16 back to VoIP application A on PC 10 for playback to user A, achieving a full-duplex voice call.
A wide variety of sound cards from many different manufacturers may be installed on any given PC. These sound cards often are controlled and driven from the PC by standard software interfaces such as Windows multi-media input-output (MMIO) wave drivers by Microsoft Corp. Originally sound cards were designed for basic (half-duplex) tasks such as playing sound effects in early PC games. Simultaneously capturing voice while playing the speaker was not a design priority. More recently, VoIP applications need full-duplex audio, yet the sound cards and their interfaces are not optimized for such full-duplex tasks.
FIG. 2 shows a prior-art VoIP application using large audio buffers to a sound card during a full-duplex voice call. VoIP application 30′ is running on the local PC that has sound card 20 installed. Incoming voice data is received from the Internet from a remote caller. The remote caller's voice data is extracted from these IP packets and decoded as voice data “V”. This remote voice data is loaded into buffers such as buffer 26′ on the PC and then sent to sound card 20 as buffer 26″. Buffer 26″ goes to the top of the first-in-first-out (FIFO) stack of buffers that includes other buffers 38 that should be played before buffer 26″, and next buffer 32 which is to be played once the current buffer has finished playing its voice data on speaker 22.
Once all the voice data in a buffer has been played to speaker 22, then the empty buffer 26 can be recycled to the PC and re-loaded with more recent voice data from the remote user. Buffers could be destroyed (deleted) and new buffers generated on the PC, but typically operation of the Windows MMIO re-uses the buffers after playback. The voice data is typically still in the buffer, but it is over-written with new voice data from VoIP application 30′. Alternately, pointers to the buffers may be transferred between VoIP application 30′, and the Windows MMIO subsystem.
Microphone 24 captures the local users voice and writes the digitized microphone (mic) data “M” into a current mic buffer 34 on sound card 20. Sound card 20 has an inventory of empty buffers 36 ready to be filled with microphone data. Once buffer 28 is filled with mic data, buffer 28 is passed back to the Windows MMIO on the PC and VoIP application 30′ reads the mic data from buffer 28′, and processes the mic data and sends it over the Internet to the remote caller using IP packets.
Once the mic data has been read from buffer 28′ (or a copy of buffer 28′ made), then the empty buffer 28″ can be sent back to sound card 20 and added to the inventory of empty mic buffers. Thus full and empty microphone buffers and voice (speaker) buffers are passed and recycled between the PC and sound card 20.
Most sounds on PC's are produced by loading a digital representation of the sound onto the sound card in large (or entire) chunks, and then the sound card produces the requested sound. Buffers 26, 32, 38 each typically contain 60-200 milli-seconds (ms) or more of audio data. Similarly, sounds captured by the microphone are often buffered into large chunks (60-200 ms buffers 28, 34, 36) that can be stored on disk whenever convenient. While such large buffers may be efficient for the PC, the large audio length may cause timing issues such as latency, alignment of incoming and outgoing audio, and clock accuracy, as audio is aligned at the boundaries of lengthy buffers. The MMIO interface is limited in its ability to determine the exact timing that the sound card is using in playing buffers of audio. Applications hand buffers to the MMIO layer, and at some future time the MMIO layer hands buffers back to be recycled. There is no query in MMIO to determine which buffer is currently being played back, or to determine the number of buffers on the sound card. There is no mechanism to signal when the speaker queue on the sound card is about to go empty that can reliably operate in very small time increments (below 60 to 100 mSec). Other, more sophisticated interfaces do exist, but are not supported as widely. For example DirectX 8.0 has more alignment and buffer signal choices, but can only be used on Windows XP.
There may be a significant delay from the time when a buffer 26″ of the remote caller's voice data is loaded into the top of the playback queue and when the buffer 26″ is finally played by the speaker, since other buffers 38, 32 must be played first, and these can be long buffers. For example, when 5 buffers of 200 ms of voice data are waiting to be played, the total queue delay is 1 second. A one-second delay in playback can be noticeable and quite annoying in a phone call. The general goal for VoIP is a total delay of no more than 125 to 250 mSec for the entire trip from one user to the other including all the delays across the Internet.
Since the microphone data buffers tend to be sent back to the PC immediately once filled, delays in mic data are less of a problem. The mic queue has empty buffers while the speaker queue has buffers full of voice data, so the speaker queue is especially a problem as it can add audio delays to playback. These delays can be significant when large buffers are used since the worst-case latency includes the delay to fill the mic buffer.
Issues of timing, clock accuracy, full-duplex (using both microphone and speaker feeds at the same time), latency, and alignment are not important for many computer sound tasks, and thus the interfaces and designs of sound cards and their drivers on many personal computers do not lend themselves to efficient low latency full-duplex streaming. Software drivers, operating systems, and other components can further alter timing. The use of large audio buffers compound these timing problems.
Sound cards vary widely in actual performance. Erratic behavior is sometimes observed in playback rates and transfer timing of the speaker buffers. Empty speaker buffers may be recycled after varying delays rather than precisely in sync with the audio playback timing. If the inventory of speaker buffers becomes empty, playback will pause, noticeably degrading the audio quality heard by the user. Thus the sound card is normally passed all speaker buffers as soon as possible, keeping the inventory of speaker buffers on the sound card as full as possible. This large inventory of speaker buffers increases latency as a large queue is used. Empty speaker buffers are then re-filled and returned to the sound card as soon as possible by the VoIP application.
What is desired is a VoIP system that more efficiently buffers audio to and from the sound card. Improved reliability and performance of streaming full duplex audio to and from the multi media sound subsystem of a computer such as a Windows PC is desirable. Reduction of the number of buffers in the speaker queue and the use of smaller audio buffers to the speaker queue is also desirable. A more tightly-coupled and adaptive full-duplex audio-buffer management scheme is desired.