In a conventional face-to-face meeting, the participants are normally located around a table and are capable of turning their heads towards a speaking participant, in order to see the speaking participant and to maximize the correlation of the speech reaching respective ear, which will maximize the signal-to-noise ratio.
When more than one person is talking at the same time, a human listener is able to separate the speech from the different sound sources, based on the spatial distribution of the sound, and may concentrate the hearing to a specific person. This ability is commonly referred to as the “cocktail party effect”.
However, in a conventional teleconference system, a mono-microphone will capture the speech in each of the different participating rooms, and the speech signals will be added and returned to the participating rooms through loudspeakers or headphones. Thus, in a virtual meeting, a listener may have difficulties to identify a speaking participant, and to distinguish an individual speaking participant when several participants are talking at the same time, since all the participants will appear to have the same spatial position relative the listening participant, i.e. the position of the loudspeaker.
To add video to the teleconference will enable the participants to see who is talking, but the problem to distinguish an individual speaking participant, when several of the participants are talking simultaneously, will remain. However, using three dimensional (3D) positional audio will solve this problem, and enable a participant to perceive the sound as in the real world, i.e. to “hear” the direction and the distance to a sound source. When 3D positional audio is used in a teleconference, a virtual room is reproduced with each of the participants located at a different virtual position, by a rendering of the speech of each participant as a 3D positioned virtual sound source.
FIG. 1 illustrates an exemplary conventional 3D positional audio system comprising a conference bridge 1, provided with a mixer 2 and a number of user channels 3, to which the participants of a teleconference is able to connect with different types of user terminals 4a, 4b, 4c. The conference bridge will mix the audio signals depending on the capabilities of the user terminals and their connection and a virtual room can be created, either centrally in the conference bridge or locally in the user terminals. Further, the conference bridge may communicate control data, including positional information and source identification, in addition to the audio.
In local rendering, the main task of the conference bridge is to decide which participants' speech signals that should be redirected to which local rendering user terminals, i.e. the encoded speech signals of all the participants or of only a few actively speaking participants, and the control of the virtual room, as well as the 3D positional audio rendering will be performed in the user terminal of each participant. If no transcoding is needed in the conference bridge, i.e. all the user terminals support the others' codec formats, the function of the conference bridge is computationally inexpensive, since the conference bridge only has to redirect the incoming bitstreams and not perform any encoding or audio rendering of the 3D positional audio environments.
However, in a conventional central rendering, the conference bridge will handle basically everything, including audio processing, such as noise suppression and sound level adjustment of the input signals, the rendering of the 3D positional audio environments, as well as the encoding of the created 3D positional audio environment signals. The user terminal of the participants will only decode each respective encoded signal, and possibly present a GUI (Graphical User Interface) showing the simulated virtual room. For each participating user terminal, the conference bridge will create a virtual 3D positional audio environment, which requires 3D audio rendering of the incoming speech signals from all the participants. Since a unique 3D positional audio environment signal is created for each participant as a listening participant, the number of output signals to encode will correspond to the number of participants.
The conventional positioning of the participants in a virtual room, i.e. a 3D audio environment 20, is evenly spaced around a round table, as illustrated in FIG. 2, in which the dashed lines reflect the directions of the speech from respective participant, indicated by U2 . . . up to U8, to a listening participant, indicated by U1.
The conference bridge will perform a 3D positional audio rendering of the speech signals in order to simulate the relative positions of the speaking participant with respect to a listening participant. Conventionally, the relative position of a certain participant as a speaking participant with respect to a listening participant will be different for all the participants, but the absolute position will be the same, similarly as in a non-virtual meeting.
The patent application PCT/SE2007/050344 describes enhanced methods for positioning the different participants in a virtual room, in order to improve the experience of a virtual meeting. This includes placing the participants on an arc relative a listening participant in a 3D positional audio environment created for the listening participant, and to adaptively change the positions in order to achieve symmetry, or in order to spatially separate the active talkers. To position the virtual sound sources corresponding to each participant on an arc is advantageous, since the distances to all other participants will be equal, the maximum angles to the left and right will become smaller, and the sound will be more pleasant. FIG. 6a illustrates a round-table 3D positional audio environment 60, including seven participants, with a listening position 13 for the listening participant, and the FIG. 6b shows this round table-environment 60 transformed into an arc environment 61.
A problem with the existing solutions is that central rendering is computationally expensive, since the conference bridge not only has to process the input signal from each participant, e.g. perform decoding, noise suppression and sound level adjustment, but also has to create an individual virtual 3D positional audio environment for each participant as a listening participant. Further, in order to simulate a virtual room, involving a spatial positioning of the participants with 3D audio rendering, the speech signals may have to be re-sampled to a different sampling rate, depending on the type of the user terminals. Both the re-sampling and the 3D audio rendering are costly tasks, and since a unique individual 3D positional audio environment is created for each participant as a listener, and each participant is included in the 3D positional audio environments for all the other participants, these costs will grow rapidly with an increasing number of participants.
When the audio signal simulating the 3D positional audio environment has been rendered, the signal has to be encoded before being transmitted to the user terminal of a participant. Normally, the rendered 3D positional audio environment is represented by a stereo signal, which means that a stereo codec is required for the encoding. The encoding of a stereo signal is an expensive task in terms of computational complexity, and since a unique individual 3D positional audio environment is rendered for each participant as a listener, the complexity may be very high, depending on the number of participants. Further, since the number of required encoders will correspond to the number of rendered individual 3D positional audio environments, the computational complexity will grow rapidly with an increasing number of participants.