Various audio and video conferencing services have been available for a long time, particularly in circuit-switched telecommunications networks. Teleconferencing systems can be divided into distributed and centralized systems, of which the latter ones have turned out to be more advantageous in providing teleconferencing services, considering the service providers and the implementation of terminals.
FIG. 1 illustrates a prior art design for implementing a centralized audio conference service. The teleconferencing system comprises a conference bridge CB and several terminals UE that communicate with it. Each terminal UE receives the terminal user's speech by a microphone and encodes the speech signal with a speech codec known per se. The encoded speech is transmitted to the conference bridge CB, which decodes the speech signal from the received signal. The conference bridge CB combines the speech signals received from different terminals in an audio processing unit APU using a prior art processing method, after which the combined signal comprising several speech signals is encoded by a speech codec known per se and transmitted back to the terminals UE, which decode the combined speech signal from the received signal. An audible audio signal is produced from the combined speech signal by loudspeakers or headphones. To avoid harmful echo phenomena, the audio signal transmitted to the conference bridge by a terminal is typically removed from the combined audio signal to be transmitted to that terminal.
The combined signal is produced in the conference bridge typically as a single-channel (monophonic) audio signal or as a two-channel (stereophonic) audio signal. In the conference bridge, a spatial effect, known as spatialization, can be created artificially in a two-channel audio signal. In that case the audio signal is processed to give the listeners the impression that the conference call participants are at different locations in the conference room. In that case the audio signals to be reproduced on different audio channels differ from one another. When a single-channel audio signal is used, all speech signals (i.e. the combined signal) are reproduced as mixed on the same audio channel.
The spatialization, if properly implemented, improves the speech intelligibility of the conference call participants, since the listener is able sense the speech of each participant coming from a different direction. Accordingly, the spatialization is a desired feature in conference call systems. Prior art teleconferencing systems including spatialization are described e.g. in WO 99/53673, U.S. Pat. No. 6,125,115 and U.S. Pat. No. 5,991,385.
However, these prior art arrangements have remarkable disadvantages. To create a spatialization effect the receiving terminal requires information as to which participant is speaking at each moment. In most cases, the teleconference bridge is capable of defining the information, but it has to be included in the output signal of the teleconference bridge to be transmitted to each participating terminal. There is no standardized way to include this additional information in the signal to be transmitted. Besides, the inclusion of this additional information results in increase of the bandwidth used in data transmission, which is a further disadvantage.
An alternative prior known method for creating a spatialization effect is to provide a spatialization unit within the conference bridge. All input channels are spatialized in the spatialization unit and the spatialized signal is transmitted to each participating terminal. This, in turn, increases the complexity of the conference bridge. The signal including the spatialization information requires also a greater bandwidth.
Furthermore, in certain cases even the teleconference bridge is not capable of defining which participant is speaking at each moment. For example, it is possible to use the teleconference bridge as a gateway between a monophonic conference network and a 3D-capable (stereo/n-phonic) conference network. In such a situation, the gateway teleconference bridge receives, from a teleconference bridge of the monophonic conference network, a combined signal comprising all speech signals of the participants of the monophonic conference network. Again, additional information defining which participant is speaking at each moment should be included in the combined signal in order to enable the gateway teleconference bridge to separate the speakers from each other for further spatialization processing.