Audio conference systems allow a plurality of parties at a plurality of different terminals to communicate with one another. The plurality of terminals (which are also referred to as endpoints) may have different capabilities. By way of example, one or more terminals may be monophonic endpoints which capture a single mono audio stream. Examples for such monophonic endpoints are a traditional telephone, a device with a headset and a boom microphone, or a laptop computer with an in-built microphone. On the other hand, one or more terminals may be soundfield endpoints which capture a multi-channel representation of the soundfield incident at a microphone array. An example for a soundfield endpoint is a conferencing telephone equipped with a soundfield microphone (e.g. an array of microphones).
This document sets out a general framework and several embodiments for achieving a plausible and consistent spatial conference experience in use cases where there are multiple endpoints or sources, in particular endpoints of spatial audio capture. It has been observed that too many active soundfields can create undesirable noise and spatial scene background complexity. The present document proposes several approaches to achieving a sense of presence and immersion whilst avoiding an unnatural and dense layered soundfield. The goal of the mixing schemes described in the present document is to establish what is termed ‘perceptual continuity’, where the user experience is that of a reasonably consistent conference where transitions and unnatural shifts in the voice activity and spatial soundfield are reduced (and possibly minimized).
Specifically, this document provides several schemes for achieving the above stated goals. One approach of presenting a mixed or reduced soundfield is based on the selection and transitions between a limited number of component soundfields at any point in time. Using the methods described in the present document a sense of spatial presence may be maintained, e.g. by sustaining the mix output of a single soundfield related to the endpoint which has been most recently active and significant in the conference activity, even at moments where there is no significant conference activity.