1. Field of the Invention
This invention relates to implementing spatial sound in systems that enable a person to participate in an audio conference with other people across a network. Specifically, this invention relates to a system that increases the comprehensibility of one or more speakers to enhance a participant's ability to listen to a specific speaker when multiple persons are talking, to aid in the identification of a speaker by using spatial location cues, and to decrease the perception of background noise. This invention also relates to providing spatial sound in an audio or audiovisual conference, a long distance learning system, or a virtual reality environment.
2. Discussion of the Related Technology
Spatial sound can be produced using a head-related transfer function. Head-related transfer functions have been estimated using dummy heads replicating a human head. Due to the shape of the pinna and the human head, microphones placed at the ear locations of a dummy head pick up slightly different sound signals. Differences between these sound signals provide spatial location cues for locating a sound source. Several dummy heads, some complete with ears, eyes, nose, mouth, and shoulders, are pictured in Durand R. Begault, 3-D Sound for Virtual Reality and Multimedia, 148–53 (1994) (Chapter 4: Implementing 3-D Sound). U.S. Pat. No. 5,031,216 to Görike, et al. proposes a partial dummy head having only two pinna replicas mounted on a rotate/tilt mechanism. These dummy heads are used in recording studios to manufacture binaural stereo recordings; they are not used in a teleconference environment.
In teleconference environments, integrated services digital network (ISDN) facilities are increasingly being implemented. ISDN provides a completely digital network for integrating computer, telephone, and communications technologies. ISDN is based partially on the standardized structure of digital protocols as developed by the International Telegraph and Telephone Consultative Committee (CCITT, now ITU-T), so that, despite implementations of multiple networks within national boundaries, from a user's point of view there is a single uniformly accessible worldwide network capable of handling a broad range of telephone, facsimile, computer, data, video, and other conventional and enhanced telecommunications services.
An ISDN customer premise can be interconnected with a local exchange (local telephone company) to an ISDN switch. At the customer premise, an “intelligent” device, such as a digital PBX, terminal controller, or local area network, can be connected to an ISDN termination. Non-ISDN terminals may be connected to an ISDN termination through a terminal adapter, which performs D/A and A/D conversions and converts non-ISDN protocols to ISDN protocols. Basic rate ISDN provides several channels to each customer premise, namely a pair of B-channels that each carry 64 kilobits per second (kbs) of data, and a D-channel that carries 16 kbs of data. Generally, the B-channels are used to carry digital data such as pulse code modulated digital voice signals. Usually, data on the D-channel includes call signalling information to and from the central office switch regarding the status of the customer telephone, e.g., that the telephone has gone off-hook, control information for the telephone ringer, caller identification data, or data to be shown on an ISDN telephone display.
Additionally, an Advanced Intelligent network (AIN) has been developed that overlays ISDN facilities and provides a variety of service features to customers. Because an AIN is independent of ISDN switch capabilities, AIN services can easily be customized for individual users. U.S. Pat. Nos. 5,418,844 and 5,436,957, the disclosure of which is incorporated by reference herein, describe many features and services of the AIN.
In a teleconference environment, several methods have been suggested to transmit sound with varying degrees of sound source location information. U.S. Pat. No. 4,734,934 to Boggs, et al. proposes a binaural teleconferencing system for participants situated at various locations. Each participant has a single microphone and a stereo headset, and a conference bridge connects the participants together. A monaural audio signal from each participant's microphone is transmitted to the conference bridge. The conference bridge adds time delays to the audio signal to produce an artificial sound source location ambience. The time delays added to each incoming monaural signal simulate the location of conference participants as being in a semi-circle around a single listener. The conference bridge then transmits the delayed signals to the conference participants. This system uses a simple time delay to simulate different locations for conference participants; it does not use head-related transfer functions to create spatial sound signals representing the virtual location of each conference participant.
U.S. Pat. No. 5,020,098 to Celli proposes using left and right microphones for each participant that transmit a digitized audio signal and a phase location information signal to a conference bridge across ISDN facilities. The conference bridge then uses the transmitted location information to control the relative audio signal strengths of loudspeakers at the other participants' stations to simulate a position in the station for each remote participant. Again, this system does not use head-related transfer functions to place conference participants in different virtual locations.
U.S. Pat. No. 4,815,132 to Minami proposes a system for transmitting sound having location information in a many-to-many teleconferencing situation. This system includes right and left microphones that receive audio signals at a first location. Based on the differences between the right and left audio signals received by the microphones, the system transmits a single channel and an estimated transfer function across ISDN facilities. At a receiving location, the right and left signals are reproduced based on the single channel signal and the transfer function. Afterwards, the reproduced signals are transmitted to right and left loudspeakers at the receiving station. This system also does not use head-related transfer functions to create a virtual location for each conference participant.
None of these described systems use head-related transfer functions in a teleconference environment. Thus, these systems do not truly produce spatial sound to place conference participants in a virtual location for ease in identifying speakers and distinguishing speech.