1. Field of the Invention
The invention relates generally to audio communication over a network.
2. Background Art
Audio has long been carried in telephone calls over networks. Traditional circuit-switched time division multiplexing (TDM) networks including public-switched telephone networks (PSTN) and plain old telephone networks (POTS) were used. These circuit-switched networks establish a circuit across the network for each call. Audio is carried in analog and/or digital form across the circuit in real-time.
The emergence of packet-switched networks, such as the local area networks (LANs), and the Internet, now requires that audio be carried digitally in packets. Audio can include but is not limited to voice, music, or other type of audio data. Voice over Internet Protocol systems (also called Voice over IP or VOIP systems) transport the digital audio data belonging to a telephone call in packets over packet-switched networks instead of traditional circuit-switched networks. In one example, a VOIP system forms two or more connections using Transmission Control Protocol/Internet Protocol (TCP/IP) addresses to accomplish a connected telephone call. Devices that connect to a VOIP network must follow standard TCP/IP packet protocols in order to interoperate with other devices within the VOIP network. Examples of such devices are IP phones, integrated access devices, media gateways, and media servers.
A media server is often an endpoint in a VOIP telephone call. The media server is responsible for ingress and egress audio streams, that is, audio streams which enter and leave a media server respectively. The type of audio produced by a media server is controlled by the application that corresponds to the telephone call such as voice mail, conference bridge, interactive voice response (IVR), speech recognition, etc. In many applications, the produced audio is not predictable and must vary based on end user responses. Words, sentences, and whole audio segments such as music must be assembled dynamically in real time as they are played out in audio streams.
Packet-switched networks, however, can impart delay and jitter in a stream of audio carried in a telephone call. A real-time transport protocol (RTP) is often used to control delays, packet loss and latency in an audio stream played out of a media server. The audio stream can be played out using RTP over a network link to a real-time device (such as a telephone) or a non-real-time device (such as an email client in unified messaging). RTP operates on top of a protocol such as the User Datagram Protocol (UDP) which is part of the IP family. RTP packets include among other things a sequence number and a timestamp. The sequence number allows a destination application using RTP to detect the occurrence of lost packets and to ensure a correct order of packets are presented to a user. The timestamp corresponds to the time at which the packet was assembled. The timestamp allows a destination application to ensure synchronized play-out to a destination user and to calculate delay and jitter. See, D. Collins, Carrier Grade Voice over IP, Mc-Graw Hill: United States, Copyright 2001, pp. 52-72, the entire book of which is incorporated in its entirety herein by reference.
A media server at an endpoint in a VOIP telephone call uses protocols such as RTP to improve communication quality for a single audio stream. Such media servers, however, have been limited to outputting a single audio stream of RTP packets for a given telephone call.
A conference call links multiple parties over a network in a common call. Conference calls were originally carried out over a circuit-switched network such as a plain old telephone system (POTS) or public switched telephone network (PSTN). Conference calls are now also carried out over packet-switched networks, such as local area networks (LANs) and the Internet. Indeed, the emergence of voice over the Internet systems (also called Voice over IP or VOIP systems) has increased the demand for conference calls over networks.
Conference bridges connect participants in conference calls. Different types of conference bridges have been used depending in part upon the type of network and how voice is carried over the network to the conference bridge. One type of conference bridge is described in U.S. Pat. No. 5,436,896 (see the entire patent). This conference bridge 10 operates in an environment where voice signals are digitally encoded in a 64 Kbps data stream (FIG. 1, col. 1, lns. 21-26).
Conference bridge 10 has a plurality of inputs 12 and outputs 14. Inputs 12 are connected through respective speech detectors 16 and switches 18 to a common summing amplifier 20. Speech detector 16 detects speech by sampling an input data stream and determining the amount of energy present over time. (col. 1, lns. 36-39). Each speech detector 16 controls a switch 18. When no speech is present switch 18 is held open to reduce noise. During a conference call, inputs 12 of all participants who are speaking are coupled through summing amplifier 20 to each of the outputs 14. Subtractors 24 subtract each participant's own voice data stream. A number of participants 1-n then can speak and hear each other in the connections made through conference bridge 10. See, '896 patent, col. 1, ln. 12-col. 2, ln. 16.
Digitized voice is now also being carried in packets over packet-switched networks. The '896 patent describes one example of asynchronous mode transfer (ATM) packets (also called cells). To support a conference call in this networking environment, conference bridge 10 converts input ATM cells to network packets. Digitized voice is extracted from the packets and processed in conference bridge 12 as described above. At the summed output digitized voices are re-converted from network packets back to ATM cells prior to being sent to participants 1-n. See, '896 patent, col. 2, ln. 17-col. 2, ln. 36.
The '896 patent also describes a conference bridge 238 shown in FIGS. 2 and 3 which processes ATM cells without converting and re-converting the ATM cells to network packets as in conference 10. Conference bridge 238 has inputs 302-306, one from each of the participants, and outputs 302-306, one to each of the participants. Speech detectors 314-318 analyze input data aggregated in sample and hold buffers 322-326. Speech detectors 314-318 report the detected speech an/or volume of detected speech to controller 320. See, '896 patent, col. 4, lns. 16-39.
Controller 320 is coupled to a selector 328, gain control 329 and replicator 330. Controller 320 determines which of the participants is speaking based on the outputs of speech detectors 314-318. When one speaker (such as participant 1) is talking, controller 320 sets selector 328 to read data from buffer 322. The data moves through automatic gain control 329 to replicator 330 . Replicator replicates the data in the ATM cell selected by selector 328 for all participants except the speaker. See, '896 patent, col. 4, ln. 40-col. 5, ln. 5. When two or more speakers are speaking, the loudest speaker is selected in a given selection period. The next loudest speaker is then selected in a subsequent selection period. The appearance of simultaneous speech is kept up by scanning speech detectors 314-318 and reconfiguring selector 328 at appropriate interval such as six milliseconds. See, '896 patent, col. 5, lns. 6-65.
Another type of conference bridge is described in U.S. Pat. No. 5,983,192 (see the entire patent). In one embodiment, a conference bridge 12 receives compressed audio packets through a real-time transport protocol (RTP/RTCP). See, '192 patent, col. 3, ln. 66-col. 4, ln. 40. Conference bridge 12 includes audio processors 14a-14d. Exemplary audio processor 14c associated with a site C (i.e., a participant C) includes a switch 22 and selector 26. Selector 26 includes a speech detector which determines which of other sites A, B, or D has the highest likelihood of speech. See, '192 patent, col. 4, lns. 40-67. Alternatives include selecting more than one site and using an acoustic energy detector. See, '192 patent, col. 5, lns. 1-7. In another embodiment described in the '192 patent, the selector 26/switches 22 output a plurality of loudest speakers in separate streams to local mixing end-point sites. The loudest streams are sent to multiple sites. See, '192 patent, col. 5, lns. 8-67. Configurations of mixer/encoders are also described to handle multiple speakers at the same time, referred to as “double-talk” and “triple-talk.” See, '192 patent, col. 7, ln. 20-col. 9, ln. 29.
Voice-over-the-Internet (VOIP) systems continue to require an improved conference bridge. For example, a Softswitch VOIP architecture may use one or more media servers having a media gateway control protocol such as MGCP (RFC 2705). See, D. Collins, Carrier Grade Voice over IP, Mc-Graw Hill: United States, Copyright 2001, pp. 234-244, the entire book of which is incorporated in its entirety herein by reference. Such media servers are often used to process audio streams in VOIP calls. These media servers are often endpoints where audio streams are mixed in a conference call. These endpoints are also referred to as “conference bridge access points” since the media server is an endpoint where media streams from multiple callers are mixed and provided again to some or all of the callers. See, D. Collins, p. 242.
As the popularity and demand for IP telephony and VOIP calls increases, media servers are expected to handle conference call processing with carrier grade quality. Conference bridges in a media server need to be able to scale to handle different numbers of participants. Audio in packet streams, such as RTP/RTCP packets, needs to be processed in real-time efficiently.