Audio teleconferences using monaural audio reproduction suffer from several problems. First, when multiple participants are speaking simultaneously there can be a loss of intelligibility. Second, it is difficult to identify the talker unless the listener is familiar with the timbre of the talker's voice. Spatial teleconferencing using binaural or stereo audio reproduction solves these problem by reproducing spatial localization cues. Hence, the listener can use his localization abilities to attend to a single talker in the presence of interfering conversations, commonly called the “cocktail party effect.” Also, the listener can more easily identify the talker on the basis of their location.
There are two basic architectures for teleconferencing: client-client and client-server. In a client-client (also called peer-to-peer) architecture, each endpoint client terminal makes a network connection to every other terminal in the conference; hence, there is no centralized server. Client-client architectures are conceptually simple but require increasing network bandwidth at each terminal as each new participant is added to the conference. As a result, they are typically effective for only a small number of participants (e.g., three to four).
In a client-server architecture, by contrast, each endpoint client terminal makes a bidirectional connection to a server. Accordingly, the bandwidth requirements for each terminal do not depend on the number of participants; only the server needs a high bandwidth connection to the network. Furthermore, only a single bidirectional connection is required to add a new participant to the conference. Conventional client-server architectures are appropriate for small to medium-size conferences. A possible disadvantage of client-server architectures over client-client systems is the additional audio latency caused by receiving the audio at the server and retransmitting the audio from the server to the clients.
One emerging application for teleconferencing is three-dimensional (3-D) interactive games, where the player is given a first person viewpoint into a virtual world. These games use 3-D graphics to render a realistic world image, and employ 3-D audio techniques to render spatialized sound with environmental effects to complete the illusion of being immersed in a virtual world. These games may also allow multiple remote players, connected via a network, to compete or collaborate in the virtual world. Each player controls a virtual representation of himself, called an avatar, and can navigate in the virtual world and perform other actions. Recently, massively multiplayer online role-playing games (MMORPGs) that allow large numbers of simultaneous players have emerged.
Techniques for reproducing 3-D audio including spatial localization cues and environmental audio effects are fairly well understood; see, e.g., Gardner, “3-D Audio and Acoustic Environment Modeling,” Wave Arts white paper, 1999, available at <www.harmony-central.com/Computer/Programming/3d-audiop.pdf>. Spatial localization cues are reproduced by convolving the sound with a pair of head-related transfer functions (HRTFs), creating a binaural (stereo) signal which is presented to the listener over headphones. If the binaural signal is to be presented to the listener over loudspeakers, it is processed with a crosstalk canceller. Room reverberation can be rendered efficiently using systems of delays with feedback connections, or can be rendered less efficiently but more accurately by convolution with a sampled room response. The distance cue is rendered by varying the level of the sound with respect to the sound of the room reverberation. Discrete echoes off walls can be rendered using a delay to model the air propagation time, a digital equalizer to model the absorption of the wall, and convolution with HRTFs to spatialize the echo. Other environmental audio effects such as source directivity, object occlusion, and air absorption can be modeled using digital equalizers. The Doppler motion effect can be modeled using a variable delay. Three-dimensional interactive games use these techniques to render sounds, reproducing the spatial location, reverberation, and other environmental effects so as to recreate a completely realistic listening situation.
It would be highly advantageous for participants in virtual worlds and interactive games to have the ability to talk with other participants—in essence, to form a teleconference with other participants. However, application of existing teleconferencing technology falls short of a desirable solution. As discussed earlier, monaural teleconferencing suffers from intelligibility and speaker identification problems due to the requirement of locating all speakers at the same position. Furthermore, monaural conferencing is unable to match the perceived location of speakers with their corresponding locations in the virtual world. Spatial teleconferencing techniques have the ability to locate speakers at different positions chosen a priori by a conference administrator, but there is no way to have the positions update dynamically as the listener changes orientation or as the participants move in the virtual space. Furthermore, reverberation, distance cues, and environmental audio effects, which are essential for conveying the sense of a realistic auditory scene, are not provided. Existing techniques do not provide methods for the conferences to be created on the basis of proximity in the virtual world. And, there is no way to handle a large number of simultaneous participants.