Tele- and videoconferencing servers enabling a spatial meeting experience have been the subject of a number of the applicant's previous patent applications and other disclosures. Efficiency and high scalability in such conferencing servers can be achieved for instance by their ability to forward data with minimal processing, so that a given mixing strategy—a desired composition of the output signals—can be realized at small computational cost and the resulting signals can be distributed with moderate bandwidth. The sense of a continuous and plausible audio scene is achieved with only a few simultaneous outgoing audio streams (possibly combined with output video or other media data) sent to any endpoint for rendering. Further, if “mixing” is seen as an operation that creates a specific signal or concatenation of data that arises from multiple input data sources, then the required amount of actual mixing—adding signals or channels together as opposed to simply forwarding—can be significantly reduced by using a family of layered spatial audio formats across the conferencing system.
It is desirable to allow each endpoint to decide on its preferred input and output layered audio formats, possibly during operation as well, to account for network jitter, packet loss rate and similar temporary variations. It is further desirable to concentrate server resources to participants deemed to be playing a prominent role in the conference (e.g., by mixing their data at an improved fidelity compared to participants interjecting into the discussion) and to avoid processing audio streams that currently do not carry any meaningful information, such as background noise rather than speech by a participant. These two facts together give rise to a considerable number of possible mixing configurations, depending on:                the spread of rendering capabilities of the endpoints and choices made by users of the endpoints, which influences the number of unique output signals;        the variations in available momentary bandwidth between the server and the endpoints, which influences the set of suitable output formats;        the spread of different input signal formats, which influences the amount of format conversions required in connection with mixing; and        the number of simultaneously active input signals, which determines both the nature of the mixing and—if server-mediated side tone is being avoided (i.e., the server avoids echoing each client's own media stream back to that client)—the number of unique output signals.        
More precisely, in existing conferencing systems of this type, a sending endpoint may elect to send layers up to its full capability. It may at some times, either locally or as directed by a central server or other system component, send a reduced set of layers and/or a signal with a reduced degree of continuity. In general, such a reduction of upstream transmission of the capture would be associated with an endpoint not being particularly active or important in a given conference. Furthermore, the server may accept a large set of incoming streams with different sets of layers. Generally, there will be more information in the incoming layers (in terms of functional and spatial layers) than the server would combine or forward on to other output endpoints. Therefore, it is a general design aim that the server can manage this varying set of input layers from a set of devices, as well as the varying mixing strategy and actions required to strip, combine, mix and/or forward the media streams represented in the layered format. Generally, the output format for a given endpoint will be set by the device capability or user selection (e.g., use of headphones or speakers). The media data are sent out to each endpoint in some format which may range from the forwarded component media streams through to the actual device audio signals, with associated metadata, such that the endpoint can reconstruct the desired audio scene. In this way, at any point in time, the count and format of the layered audio media streams sent to the output client can change dynamically and is decided by the server against some criteria that may be imposed by any given endpoint.
In this setting, each mixing configuration—the combination of the number and formats of the input signals and the number and formats of the output signals—may be realized by a series of operations including unpacking and packing of media data (e.g., converting between transport formats and internal formats), operations on data values (e.g., applying gain/equalizing, adding signals together, removing reverb, gating, adding comfort noise, applying virtualization based on head-related transfer functions), conversions between different layered formats, different specific standard or proprietary coding formats, memory management etc. Different implementations of a same mixing configuration may differ in performance, which the programmer may improve by trying to explicitly and predictively locate and eliminate redundant instructions, reusing intermediate results, evaluating different orderings of the operations and changing the point of operation between the server and client or other networked computational resource (e.g., a slave server which mixes streams on behalf of the master server). As an example, the task of inputting several signals in different input formats A and B, and outputting a mix of these in an output format C can be achieved by each of the following tactics: conversion into C followed by C-mixing; conversion into B followed by B-mixing followed by conversion into C; conversion into A followed by A-mixing followed by conversion into C; separate A-mixing and B-mixing followed by conversion of one of the partial mixes into B followed by B-mixing of the partial mixes followed by conversion into C, etc. While performance can typically be measured or predicted for a concrete implementation (e.g., by a clock cycle count), it is not clear from the outset which tactic will be the most promising one and it may be a tedious task to explore all important candidates. Additionally, where there are many simultaneous users and endpoints participating in a related conference, the re-use of these intermediate format, manipulations, sub-mixes and conversions can be optimized across a large set of desired output mixes.
A routine approach to the problem outlined above has been to consider each mixing configuration separately and have one or more programmers implement it ‘as optimally’ as the circumstances permit, after which the result is stored as computer-readable code ready for execution by the conferencing server. Such code will generally be written to include a large set of tests and conditional branches which are constructed to achieve the desired outcome with some sense of efficiency, ordering and scalability. The conferencing server may sense the number and formats of the active input signals and respond by determining a relevant code portion (or script), loading the code portion from memory and executing this in a media-enabled processor. In a rule-based conferencing server of this type, the steps of sensing, determining and loading can typically be made very fast, so that it matches the sudden speaker changes typical of a human conversation. As already noted, however, the number of mixing configurations to implement may be very large, which has a direct impact on the costs in the design phase or a later re-design phase unless some mixing configurations are dropped. For instance, if the designer accepts deviations from the desirable aim of not processing silent (or inactive) input signals, it may be sufficient to implement only the relatively more versatile mixing configurations and omit more specialized ones. For instance, a routine implementing a mixing configuration with four inputs can be utilized for mixing three active signals if the routine is additionally fed with a fourth signal, either the signal having been deemed inactive or a dummy signal with placeholder values.
As the number of potential formats and mixing possibilities increases, there is a geometric increase in the number of potential routes that could be followed. Whilst there is often a limited number of high-impact optimizations, as the complexity grows with formats and mixing strategies, code to reliably find and optimize the underlying operations becomes difficult to manage, and validate. When such systems develop iteratively, additional code to achieve a desired optimization can have unexpected, undesirable impact on another aspect or operating condition. In the context of this invention, at the point where multiple layered formats covering spatial and functional audio properties combined with a desire for novel and dynamic mixing strategies, the system was no longer feasible to manage as static conditional code.
Rather than repeating the trade-off between the conflicting requirements outlined above, it would be an attractive option to approach the real-time mixing problem encountered in conferencing servers from a different direction. This is a purpose of the present invention.
All the figures are schematic and generally only show parts which are necessary in order to elucidate the invention, whereas other parts may be omitted or merely suggested. Unless otherwise indicated, like reference numerals refer to like elements in different figures.