The invention relates to managing audio packets in a packet-based audio system with multiple speakers and more particularly to adaptively selecting which audio packets to mix together as an audio signal depending on the active status of the multiple speakers.
In audio applications, there is often the need to talk with more than one speaker at a time. Functions such as conferencing and three-way calling require a receiver to separately handle multiple simultaneous and serial audio streams. In a circuit-switched system, such as the Public Service Telephone Network (PSTN), these functions are typically handled either by an edge switch or a special purpose device called an xe2x80x9caudio bridgexe2x80x9d or Multipoint Control Unit (MCU). In a packet audio system, there are better solutions to the transport of audio, such as using multicast transmission. These techniques, however, require the receivers to perform many of the processing functions of an MCU.
The processing-intensive task of processing calls from multiple speakers can not be reasonably assigned to the limited processing resources of individual receivers. For example, an interactive conference call conducted for a seminar might include hundreds of callers. Individual receivers do not have the processing resources to even track the state information for every caller at the seminar much less process the packets for each participant in the call.
Receiver-based systems do exist that process calls from multiple speakers. However, most receiver-based systems can only unintelligently select one of the speakers for playout at a time. Alternatively, receiver-based systems attempt to mix all speakers together until receiver resources are exhausted which end up producing an unintelligible playout. Therefore MCUs must be used even though the packet transmission system can deliver the audio packets directly to all the receivers using multicast. Using one MCU introduces a single point of failure, along with additional overhead and delay in the audio system as a whole.
Accordingly, a need remains for a receiver-based audio packet management system that intelligently selects which audio packets to mix together.
A receiver manages multiple speakers in a packet network. A packet gateway receives audio packets from the multiple speakers over the packet network. Memory in the receiver stores the audio packets and information about the multiple speakers in the telephone call. A processor selects which speaker audio packets and speaker information to retain in memory. The processor determines which of the selected-audio packets to store in memory-and mix together to produce an audio output signal by determining from the stored speaker information which of the multiple speakers are actively talking.
The speaker information is kept-in an indexed data array that identifies the speakers, a talking status for the speakers; and a pointer to buffers in memory that retain the audio packets for the speakers. Speaker entries in the data array also include a Least Recently Used (LRU) time indicating a local time the last audio packet was received for the speaker. The processor uses the LRU time to determine which speakers are actively talking and which speakers have stopped talking. A Talkspurt Packet Count indicates a single connected segment of audio. The processor uses the Talkspurt Packet Count to distinguish audio packets coming from a speaker who is actively talking from audio packets containing background noise.
The receiver identifies the status of the speaker entries in the data array as actively talking (A), not actively talking but valid (V) or not in use (F). Depending on available processing resources, the speaker status, LRU time, and Talkspurt Packet Count, speaker entries are stored, discarded or changed in the data array and audio packets from speakers are either stored or discarded in memory.
The invention solves the multi-speaker problem by using an adaptive speaker management scheme to intelligently select which speaker states and audio to retain and process. The receiver-based system is especially effective for demanding applications such as distance learning, where a professor is speaking most of the time, but an occasional question from a few of many listening students could arrive at any time. By judiciously using both processing resources and memory in the receiver, audio packets from multiple speakers are handled with less processing resources while at the same time improving audio quality.