Conventionally, a technical implementation of a multi-party call, e.g. a voice conference, would typically comprise a central mixing device for mixing the media streams originating from the participants in the conference into only one media stream per media type, to be delivered to every participating client. For a voice conference, this corresponds to one mono media stream or one artificial stereo media stream. One reason for delivering only one media stream to each participant was the limited access bandwidth.
However, in recent technologies, such as e.g. the VDSL2 (Very high speed Digital Subscriber Line 2), a much larger access bandwidth is available, which removes the bandwidth limitation, at least in applications with a low or moderate bandwidth requirement, such as e.g. a voice conference.
In order to provide true stereo or 3D (Three-Dimensional) positional audio to each participant in a multi-party call, a unique media stream has to be rendered for each client, based on the orientation and position of the client, and on the position and orientation of the other participants in the call. Thus, the central rendering framework needs information regarding the location and orientation of each participant, and has to implement one rendering engine for each client. Further, these rendering engines have to be constantly updated with the position and orientation for each participant. This is especially challenging in large and highly dynamic conference call, such as in virtual world gaming. In such an advanced audio mixing scenario, involving a large number of participants in a multi-party call, a central voice mixing will lead to complicated system architecture for the media rendering, requiring a very large processing capacity in the central voice mixing device.
In a conventional central rendering, a media server, typically a conference bridge comprising a mixer, handles basically everything, including audio processing, rendering of 3D positional audio, as well as the encoding of the created 3D positional audio signals for each client. The client user equipments belonging to each participant will only decode each respective encoded signal, and possibly present a GUI (Graphical User Interface) to the user. For each participating client user equipment, the conference bridge will create a 3D positional audio signal, which requires 3D positional audio rendering of the incoming voice signals from all the participants. Since a unique 3D positional audio signal is created for each participant, the number of output signals to encode will correspond to the number of participants.
Further, in a conventional central rendering, the latency in the positional information in highly interactive application may make a faithful voice rendering impossible, and thus deteriorate the user experience.
In a local rendering, on the contrary, the main task of the central media server is to decide which media streams of a multi-party call that should be redirected to which client user equipment for local rendering, such as e.g. the media streams from all the participants, or alternatively from only a few actively speaking participants. Upon receiving the selected media streams from the media server, the client user equipment of each participant will perform local media rendering. If no transcoding is needed in the media server, i.e. if all the client user equipments support the codecs of every other client user equipment, the media server only has to re-direct the incoming media streams, and not perform any encoding or audio rendering.
WO2009/092060 describes a system for rendering of the media locally in the client. A local rendering of 3D positional audio requires less processing capacity in the central device, i.e. a media server, and reduces the latency in the positional information. In the system described in WO2009/092060, each media stream contains the media data (i.e. the voice) and the positional information (i.e. the location and energy of the media component). Furthermore, WO2009/092060 discloses a per-participant “filter component” (see e.g. 111 in FIG. 1) that accesses the positional information and the media streams of all participants, as well as local information related to the participants. However, the implementation of the system is comparatively complicated.