It is known for conferencing applications that one or more multipoint control units (MCUs) are used to control audio from a plurality of audio sources. In general, each audio source is represented by a client. A plurality of clients are connected to one MCU. A plurality of MCUs may be interconnected, as a mesh or tree, or in a hybrid mesh/tree structure. If large voice conferences are being established, requirements to the MCU grow with a number N of attached clients. Especially performance and bandwidth requirements for multipoint processor (MP) processing voice itself grow with each additional client. If more MCUs are interconnected, some kind of scalability can be attained. However, each MCU introduces payload delay which cannot be decreased under a few tens of milliseconds. Thus, scalability is limited.
To be more specific, FIG. 2 shows an example embodied in the prior art where a multipoint control unit (MCU) 10 controls a plurality of clients 30 (client 1 . . . client N). Here, a number of N clients 30 is assumed. (N is referred to as a number of clients to be controlled, as a whole.) In this configuration, as mentioned above, MCU performance is a bottleneck as well as MCU network connection.
FIG. 3 shows another example embodied in the prior art where multiple MCUs 10 each controlling a plurality of clients 30 are interconnected in the form of a full mesh. In this configuration, a good voice delay can be attained as voice delay is limited to 2×MCU_delay. It will be noted that the number of MCU interconnections grows fast, following the relation m*=(M×(M−1))/2 where m* is the number of MCU-MCU interconnections among all MCUs in the mesh and M is the number of MCUs in the mesh. With, then, m** representing the maximum number of possible MCU-MCU and MCU-Client connections, a maximum number Nmax of clients N is limited to about Nmax=((m**+1)2)/4.
FIG. 4 shows another example embodied in the prior art where multiple MCUs 10 each controlling a plurality of clients 30 are interconnected in the form of a 2-levels tree structure. In this configuration, voice delay grows with the number of levels and is, in this example, (3×MCU_delay) which may be on an edge of acceptance or may be unacceptable on less quality networks. With m** representing the maximum number of MCU connections again, the number of clients N is limited to about Nmax=m**×(m**−1).
The MCU configurations described above do not scale well because each MCU contains a jitter buffer and a mixing unit introducing significant delay which cannot be reduced. All mixing algorithms do not use really all input streams but select just some of them and only those selected streams are mixed.
From WO2012120240 or US2013342639 it is known to distribute mixing of (video) audio streams. There are one main and one or more secondary media servers. Clients are connected to these servers. The main media server comprises a selection module to select a plurality of incoming streams and a global mixing unit to create an aggregated stream comprising the selected streams. A secondary server comprises a local mixing unit to mix input streams, which are selected by the main server's selection module. As a result, two planes of mixing units are provided.
According to EP2285106 which is similar to US2013342639, the distributed mixing units are controlled by a common application server. As above, distributed mixing is provided.
U.S. Pat. No. 8,437,281 discloses that the mixing process is distributed across nodes in a network and may even take place in an end node (aka terminal). It is provided that payload (or session) paths between the various nodes are free of loops, because a tree hierarchy with one root node and a number of leaf nodes is established. Tree establishment depends on the sequence the nodes enter the conference.