1. Field of the Invention
The present invention relates generally to computer-based telephony networks and more particularly to servers that manage telephony conferencing.
2. Related Art
In today""s technological environment, there exists many ways for several people who are in multiple geographic locations to communicate with one another simultaneously. One such way is audio conferencing. Audio conferencing applications serve both the needs of business users (e.g., national sales force meeting) and leisure users (e.g., audio chat room participants) who are geographically distributed.
Traditional audio conferencing involved a central conferencing server which hosted an audio conference. Participants would use their telephones and dial in to the conferencing server over the Public Service Telephone Network (PSTN) (also called the Plain Old Telephone System (POTS)).
In recent years, the possibility of transmitting voice (i.e., audio) over the worldwide public Internet has been recognized. As will be appreciated by those skilled in the relevant art(s), the connectivity achieved by the Internet is based upon a common protocol suite utilized by those computers connecting to it. Part of the common protocol suite is the Internet Protocol (IP), defined in Internet Standard (STD) 5, Request for Comments (RFC) 791 (Internet Architecture Board). IP is a network-level, packet (i.e., a unit of transmitted data) switching protocol.
Transmitting voice over IP (VoIP) began with computer scientists experimenting with exchanging voice using personal computers (PCs) equipped with microphones, speakers, and sound cards. VoIP has further developed with the adoption of the H.323 Internet Telephony Standard, developed by the International Telecommunications Union-Telecommunications sector (ITU-T), and the Session Initiation Protocol (SIP), developed within the Internet Engineering Task Force (IETF) Multiparty Multimedia Session Control (MMUSIC) Working Group.
Conferencing servers (also called multipoint control units (MCUs)) were developed to host audio conferences where participants are connected to a central MCU using PC-based equipment and the Internet, or using a telephone through a gateway, rather than traditional telephone equipment over the PSTN.
One common problem, however, exists in both MCUs that support Internet-based telephony and conferencing servers that support traditional PSTN-based telephony. This problem is now described (with conferencing servers and MCUs being referred to generally herein as MCUs).
MCUs, in general, enable multipoint communications between two or more participants in a voice conference. An MCU may support many conferences at one time, each of which have many participants. Each participant in a given conference will hear a mix of up to n active speakers, except for the active speakers themselves, who hear the mix minus themselves (this is, in essence, an xe2x80x9cecho suppressionxe2x80x9d function so that a party will not xe2x80x9chear themselves speakxe2x80x9d during the audio conference). For ease of explanation herein, and as will be appreciated by those skilled in the relevant art(s), the module in an MCU that does the active speaker detection, mixing or multiplexing, switching and streaming of the audio is referred to herein as the xe2x80x9cMixer.xe2x80x9d
In the case where the Mixer needs to do mixing of multiple audio streams or accept different packet sizes from different participants, the Mixer needs a buffer (i.e., a memory storage area) in which to receive audio data. This buffer may be large if it also needs to accommodate jitter (the random variation in the delivery time) in packet arrival times. From a memory standpoint, it would be most efficient to assign buffers only to the active speakers rather than to all participants in a conference, and to reassign the buffers as the active speakers change. However, there is a drawback to only collecting data for the active speakers. Often times, the active speaker update event within a Mixer does not detect a new active speaker until enough xe2x80x9cloudxe2x80x9d packets have gone by to trigger the selection of the speaker as a new active speaker. This can cause the first word to be partially lost in the new active speaker""s audio stream.
Therefore, given the above, what is needed is a method and computer program product for the efficient allocation of buffers for current and predicted active speakers in voice conferencing systems.
The present invention is directed to a method and computer program product for the efficient first-in first-out FIFO (i.e., queue) allocation for current and predicted active speakers in voice conferencing systems, that meets the above-identified needs.
The method and computer program product of the present invention receive a packet from a speaker participating in a conference, wherein the speaker is not currently designated as an xe2x80x9cactivexe2x80x9d speaker nor as a xe2x80x9cpredicted activexe2x80x9d speaker. Then, a first test is applied to determine whether the speaker should now be designated as a xe2x80x9cpredicted activexe2x80x9d speaker. The test is a comparison between the energy measurement of the packet (or the speaker""s energy averaged over some pre-determined time period and including such packet) and any one of numerous possible functions of the energies of the current xe2x80x9cactivexe2x80x9d or xe2x80x9cpredicted activexe2x80x9d speakers. The method and computer program product of the present invention discard the packet when the packet fails the first test. If the packet passes the first test, the steps described below are performed.
First, a determination is made as to whether there is an unallocated buffer from among a set of p xe2x80x9cpredicted activexe2x80x9d speaker buffers. If so, the packet is stored in the unallocated buffer. If not, a determination is made, by using a second test on the packet, whether the speaker should now be designated as a xe2x80x9cpredicted activexe2x80x9d speaker, thereby replacing a current predicted active speaker using one of the set of p xe2x80x9cpredicted activexe2x80x9d speaker buffers. The second test, like the first, is a comparison between the energy measurement of the packet (or the speaker""s energy averaged over some pre-determined time period including such packet) and any one of numerous possible functions of the energies of the current xe2x80x9cactivexe2x80x9d or xe2x80x9cpredicted activexe2x80x9d speakers, although with a higher threshold than the first test.
Next, the packet is discarded if it fails the second test. If it passes the second test, a buffer from the set of p xe2x80x9cpredicted activexe2x80x9d speaker buffers that can be reassigned is identified and the packet is then stored in the identified buffer. At this point the speaker is considered a xe2x80x9cpredicted active speakerxe2x80x9d and data received from that speaker will be received into their predicted active speaker buffer.
Once that speaker becomes an xe2x80x9cactive speaker,xe2x80x9d some of the data from their predicted active speaker buffer will be used as their active speaker data. (One way of doing this is to make that speaker""s predicted active speaker buffer an active speaker buffer.) In an embodiment, the portion of the data used is equal to M-J packets, where M is a pre-determined desired jitter buffer depth and J is the current jitter buffer depth, assuming M greater than J. If Mxe2x89xa6J none (i.e., zero packets) of the data from that speaker""s predicted active speaker buffer is used. This minimizes the loss of audio data for speakers as they switch from xe2x80x9cnon-activexe2x80x9d to xe2x80x9cactivexe2x80x9d status and ensures that the delay introduced by first using the speaker""s data that has been saved into their predicted active speaker buffer is never more than the desired jitter buffer depth M.
An advantage of the present invention is that it minimizes the loss of audio data for speakers as they switch from xe2x80x9cnon-activexe2x80x9d to xe2x80x9cactivexe2x80x9d status by collecting audio data from those speakers before they are actually active. This is done in a memory efficient manner and without introducing additional delay.
Another advantage of the present invention is that it provides a method of predicting future active speakers to limit the amount of non-active speaker data collected by an MCU.
Another advantage of the present invention is that it provides a method for maintaining a collection of the most recent x packets or m milliseconds of xe2x80x9cnon-activexe2x80x9d speaker audio data in single or multiple buffers, and using this data in the event that the non-active speaker becomes an active speaker.
Yet another advantage of the present invention is that the x packets or m milliseconds of stored xe2x80x9cnon-activexe2x80x9d speaker audio data can be used only up to a pre-determined jitter buffer fill-level in order to avoid introducing additional audio packet delivery delay.
Further features and advantages of the invention as well as the structure and operation of various embodiments of the present invention are described in detail below with reference to the accompanying drawings.