1. Field of the Invention
The invention relates generally to audio communication over a network. In particular, the invention relates to audio conferencing over a network.
2. Background Art
A conference call links multiple parties over a network in a common call. Conference calls were originally carried out over a circuit-switched network such as a plain old telephone system (POTS) or public switched telephone network (PSTN). Conference calls are now also carried out over packet-switched networks, such as local area networks (LANs) and the Internet. Indeed, the emergence of voice over the Internet systems (also called Voice over IP or VOIP systems) has increased the demand for conference calls over networks.
Conference bridges connect participants in conference calls. Different types of conference bridges have been used depending in part upon the type of network and how voice is carried over the network to the conference bridge. One type of conference bridge is described in U.S. Pat. No. 5,436,896 (see the entire patent). This conference bridge 10 operates in an environment where voice signals are digitally encoded in a 64 Kbps data stream (FIG. 1, col. 1, Ins. 21-26). Conference bridge 10 has a plurality of inputs 12 and outputs 14. Inputs 12 are connected through respective speech detectors 16 and switches 18 to a common summing amplifier 20. Speech detector 16 detects speech by sampling an input data stream and determining the amount of energy present over time. (col. 1, Ins. 36-39). Each speech detector 16 controls a switch 18. When no speech is present switch 18 is held open to reduce noise. During a conference call, inputs 12 of all participants who are speaking are coupled through summing amplifier 20 to each of the outputs 14. Subtractors 24 subtract each participant's own voice data stream. A number of participants 1-n then can speak and hear each other in the connections made through conference bridge 10. See, '896 patent, col. 1, In. 12-col. 2, In. 16.
Digitized voice is now also being carried in packets over packet-switched networks. The '896 patent describes one example of asynchronous mode transfer (ATM) packets (also called cells). To support a conference call in this networking environment, conference bridge 10 converts input ATM cells to network packets. Digitized voice is extracted from the packets and processed in conference bridge 12 as described above. At the summed output digitized voices are re-converted from network packets back to ATM cells prior to being sent to participants 1-n. See, '896 patent, col. 2, 1n. 17-col. 2, 1n. 36.
The '896 patent also describes a conference bridge 238 shown in FIGS. 2 and 3 which processes ATM cells without converting and re-converting the ATM cells to network packets as in conference 10. Conference bridge 238 has inputs 302-306, one from each of the participants, and outputs 308-312, one to each of the participants. Speech detectors 314-318 analyze input data aggregated in sample and hold buffers 322-326. Speech detectors 314-318 report the detected speech an/or volume of detected speech to controller 320. See, '896 patent, col. 4, Ins. 16-39.
Controller 320 is coupled to a selector 328, gain control 329 and replicator 330. Controller 320 determines which of the participants is speaking based on the outputs of speech detectors 314-318. When one speaker (such as participant 1) is talking, controller 320 sets selector 328 to read data from buffer 322. The data moves through automatic gain control 329 to replicator 330. Replicator replicates the data in the ATM cell selected by selector 328 for all participants except the speaker. See, '896 patent, col. 4, In. 40-col. 5, In. 5. When two or more speakers are speaking, the loudest speaker is selected in a given selection period. The next loudest speaker is then selected in a subsequent selection period. The appearance of simultaneous speech is kept up by scanning speech detectors 314-318 and reconfiguring selector 328 at appropriate interval such as six milliseconds. See, '896 patent, col. 5, Ins. 6-65.
Another type of conference bridge is described in U.S. Pat. No. 5,983,192 (see the entire patent). In one embodiment, a conference bridge 12 receives compressed audio packets through a real-time transport protocol (RTP/RTCP). See, '192 patent, col. 3, In. 66-col. 4, In. 40. Conference bridge 12 includes audio processors 14a-14d. Exemplary audio processor 14c associated with a site C (i.e., a participant C) includes a switch 22 and selector 26. Selector 26 includes a speech detector which determines which of other sites A, B, or D has the highest likelihood of speech. See, '192 patent, col. 4, Ins. 40-67. Alternatives include selecting more than one site and using an acoustic energy detector. See, '192 patent, col. 5, Ins. 1-7. In another embodiment described in the '192 patent, the selector 26/switches 22 output a plurality of loudest speakers in separate streams to local mixing end-point sites. The loudest streams are sent to multiple sites. See, '192 patent, col. 5, Ins. 8-67. Configurations of mixer/encoders are also described to handle multiple speakers at the same time, referred to as “double-talk” and “triple-talk.” See, '192 patent, col. 7, In. 20-col. 9, In. 29.
Voice-over-the-Internet (VOIP) systems continue to require an improved conference bridge. For example, a Softswitch VOIP architecture may use one or more media servers having a media gateway control protocol such as MGCP (RFC 2705). See, D. Collins, Carrier Grade Voice over IP, Mc-Graw Hill: United States, Copyright 2001, pp. 234-244, the entire book of which is incorporated in its entirety herein by reference. Such media servers are often used to process audio streams in VOIP calls. These media servers are often endpoints where audio streams are mixed in a conference call. These endpoints are also referred to as “conference bridge access points” since the media server is an endpoint where media streams from multiple callers are mixed and provided again to some or all of the callers. See, D. Collins, p. 242.
As the popularity and demand for IP telephony and VOIP calls increases, media servers are expected to handle conference call processing with carrier grade quality. Conference bridges in a media server need to be able to scale to handle different numbers of participants. Audio in packet streams, such as RTP/RTCP packets, needs to be processed in real-time efficiently.