The present invention is directed to an audio conferencing server and, more specifically to an audio conferencing server for the internet.
An internet audio conferencing server allows computer users at remote locations to speak to and hear groups of other computer users and to carry on free form multi-party conversations in real time.
The term “computer user” is generally meant to be at least one person, but may have other meanings, such as at least one automated program, at least one device acting on the person's behalf, or any combination of the above (e.g. two or more people, at least one automated program, and/or at least one device). For example, when it is stated that the computer user provides an audio stream, the human computer user may be providing audio that the automated program and/or the device is “translating” (e.g. converting) into an audio stream suitable for transmission. Another example is that for some matters (e.g. technical matters), an automated program and/or a device (e.g. a computer or other processing device) could act on behalf of the human computer user with or without prior instructions from the human computer user.
“Audio conferencing” has a slightly different meaning than “free form multi-party conversations.” “Audio conferencing” is meant to include any type of multi-party audio conferencing. “Free form multi-party conversations” are more dynamic than audio conferencing. An example of free form multi-party conversations might be found in a 3D virtual world where computer users represented by graphical representations (e.g. avatars) move around and hear ambient sounds, have conversations with other computer users, and otherwise have a dynamic audio experience. The free form multi-party conversation may occur in an audio conference.
Exemplary free form multi-party conversations and audio conferencing are described in U.S. patent Ser. No. 11/233,773 (the '773 reference), which is assigned to the assignee of the present application, the disclosure of which is incorporated herein by reference. The '773 reference describes an advanced voice server to which a plurality of clients (computer users) may connect. The advanced voice server is able to perform processing functions with real-time-updated processing parameters uniquely for each client voice (audio input to an audio input device), for each client. Each client has a unique mix-list of the processing functions and their respective processing parameters that the advanced voice server uses to uniquely mix a unique voice mix for each client to be heard on an audio output device. The processing parameters may be supplied by the client, by a system administrator, or by an automated process acting on behalf of the client. In addition, exemplary audio conferences and/or free form multi-party conversations are described in U.S. Pat. No. 6,125,115 to Smits, U.S. Pat. No. 4,650,929 to Boerger et al., U.S. Pat. No. 5,539,741 to Barraclough et al., U.S. Pat. No. 5,113,431 to Horn, the disclosures of which are incorporated herein by reference.
Audio conferencing server architecture is the system over which audio conferencing and/or free form multi-party conversations are implemented. There are three primary known prior art versions of audio conferencing server architectures: a “centralized server” audio conferencing server architecture (FIG. 1), a “central/off-loaded” audio conferencing server architecture (FIG. 2), and a “chained” audio conferencing server architecture (FIGS. 3 and 4).
FIG. 1 is directed to a first version of existing audio conferencing server architecture and, more specifically to a “centralized” audio conferencing server architecture system (also referred to herein as the “centralized server system”). The centralized server system has a one-stage audio stream between a centralized server and computer users (shown as Users 1-10). The centralized server system is the most basic version of an existing audio conferencing server that allows computer users to form connections between their local computers and a centralized server and thereby to define free form multi-party conversations. The centralized server receives a real time input audio stream from each computer user, mixes an output audio stream for each computer user, and sends each of the audio streams to the respective computer user(s). The generated audio stream that is the “output” may simply contain a generic mix of all the other computer users' input audio streams or the generated audio stream may be modified in various ways, such as by varying the gain (volume) of the various audio stream inputs and/or applying various audio effects to the various audio stream inputs to clarify the input and/or allow the listener to distinguish between input sources.
This centralized server system offers the advantage of simplicity of implementation. One limitation of such a centralized server system, however, is a capacity limitation in that a centralized server lacks the ability to scale beyond a certain number of computer users. This capacity and/or inability to scale limitation can be a problem if an arbitrary level of usage by a given population of computer users is desired that is too great for any existing centralized server hardware to support. Supporting the same computer user population on multiple servers in order to circumvent the capacity limitation is unsatisfactory. One reason that multiple servers are unsatisfactory for this centralized server system is because multiple servers require a multiplication of administrative effort. Another reason that multiple servers are unsatisfactory for this centralized server system is because computer users have to perform extra work (or steps) to determine which server they can use that has available capacity at any given time. Yet another reason that multiple servers are unsatisfactory for this centralized server system is because multiple servers require a means for allowing computer users to agree on which server to form their free form multi-party conversation at the moment they form the free form multi-party conversation. Still another reason why multiple servers are unsatisfactory is because this approach divides a large 3D virtual world into discontiguous audio spaces.
FIG. 2 is directed to a second version of existing audio conferencing server architecture and, more specifically to a “central/off-loaded” audio conferencing server architecture system (also referred to herein as the “central/off-loaded server system”). The central/off-loaded server system has a two-stage audio stream between a central server, compression gateways, and computer users (shown as Users 1-12). The central/off-loaded server system uses a central server that is connected to at least one compression gateway. The compression gateways provide some of the functions (e.g. compression, decompression, and jitter buffering) normally performed by the central server. This off-loading leaves the central server with more computational capacity available to service audio mixing. To define free form multi-party conversations, computer users form connections between their local computers and a compression gateway that, in turn, connects to the central server. Compressed audio streams are received from the internet and decompressed by compression gateways. Compression gateways are also responsible for compressing and sending the mixed output audio streams back out to the internet at the end of the mixing process. Furthermore, compression gateways are also responsible for repairing the temporal state of the audio input data if the temporal state gets damaged between the computer user's computer and the compression gateway. Correcting the temporal state is accomplished through use of a “jitter buffer” feature that trades latency for smoothness in audio streams' arrival rates by buffering arriving audio streams and metering the buffered audio streams out to the mixing function smoothly.
Whereas the centralized server system of the audio conferencing server architecture performs functions (e.g. compression, decompression, and jitter buffering) on the centralized server, the central/off-loaded server system version of the audio conferencing server architecture off-loads the performance of these functions to other server computers (i.e. the compression gateways) and the raw audio streams are transmitted over a reliable internal server site network (e.g. LAN) to the central server for mixing. Ultimately, however, the central/off-loaded server system still has the same limitations as the centralized server system: its capacity is still limited by the computer-intensive task of mixing audio streams. Even a central server aided by many compression gateways can only support a finite number of computer users.
FIG. 3 is directed to a third version of existing audio conferencing server architecture and, more specifically to a “chained” audio conferencing server architecture system (referred to herein as the “chained server system”). This chained server system has a two-stage audio stream between any of the chained servers, another chained server, and computer users (shown as Users 1-12) where the audio stream is mixed two times. Using this chained server system, computer users connect to any of the plurality of chained servers that are in a communicative relationship with each other. This chained server system attempts to solve the problems associated with the capacity limitations of the first two versions by utilizing multiple servers that are chained (e.g. networked) together. Computer users form connections to any chained server that has available capacity. The chained servers then pass audio streams between one another to bridge distributed conferences on high-speed networks located at the server site. In order to reduce the network bandwidth required between chained servers, the inputs from computer users in a free form multi-party conversation that are collocated on the same chained server are pre-mixed on that chained server. Then, the pre-mixed outputs are passed to at least some of the other chained servers with the audio from computer users in the same conference in mixed form (the pre-mixed output). This pre-mixed output is then mixed as necessary with the pre-mixed output from other chained servers and with any mixed output from computer users located on the final mixing server. The final mixed output is then transmitted to at least one computer user in the free form multi-party conversation that is directly connected to that chained server. An exemplary flow of a free form multi-party conversation using this chained server architecture is shown in FIG. 4 and discussed below. The pre-mixing is an essential feature of this chained server system since without the pre-mixing, the amount of bandwidth between the various chained servers would be equal to the bandwidth taken up by all computer users, which would then become a hard limitation to the number of computer users that could be supported in the architecture—the limitation that the chained server system is attempting to avoid. The pre-mixing also saves considerable CPU cycles on the server(s) receiving the pre-mix.
FIG. 4 is a simplified block diagram of an example of a free form multi-party conversation flow using the “chained” audio conferencing server architecture of FIG. 3. In this example, computer users 1-4 are connected to chained server A, computer users 5-8 are connected to chained server B, and computer users 9-12 are connected to chained server C. In this example, chained servers A and B mix the audio from their respective computer users and then pass the pre-mixed audio stream data to chained server C, thereby bridging the conference. At chained server C, the pre-mixed outputs from chained servers A and B are mixed with the audio from computer users 9-11 (shown as being sent individually, but alternatively being a pre-mixture) to form a final mixed output that is transmitted to computer user 12.
The chained server system solves the problem addressed above (in connection with a multiple server embodiment of a centralized server) concerning computer users being required to agree on a chained server at the moment they form their free form multi-party conversation by allowing free form multi-party conversations to be formed across chained server boundaries. The chained server system also does a reasonable job of increasing server capacity where pre-mixing can be leveraged to save network bandwidth and CPU cycles.
Another limitation of the chained server system is that it becomes impossible (or at least impossible to guarantee) that individual users can receive (or control) volume and/or effects for any given input audio stream when mixed to any given output audio stream because once multiple input audio streams are pre-mixed, they cannot be separated and mixed at the destination server. Thus, a selected gain level or effect applied to a given input audio stream in pre-mix must be received by all computer users who will get the same pre-mix in their output audio stream. Applying a gain level or effect on a pre-mix at the destination server would solve the problem of allowing each individual's output audio stream to vary according to his wishes, but this scheme would require that the same gain level or effect be applied to all of the input audio streams in a pre-mix received by the destination server. One way or another, there becomes no way to guarantee individual gains and effects can be applied to individual input audio streams for any individual output audio stream as long as the technique of pre-mixing is used. But the capacity limitations of the servers and intervening network bandwidth would become exhausted far too quickly to make the structure worthwhile to pursue without pre-mixing.
Pre-mixing restricts the ability to provide free form multi-party conversations because pre-mixing forces users to hear the pre-mixed audio stream substantially as it is pre-mixed (although there might be a variation of overall volume). Accordingly, the resulting audio stream is not “free form.”