One of the deficiencies in current multi-party voice conferences is that voices are typically all rendered to the listeners as a monaural audio stream—essentially overlaid on top of each other and usually presented to the listeners “within the head” when headphones are used. Spatialisation techniques, used e.g. to simulate different people talking from different rendered locations, can improve intelligibility of speech in a voice conference, in particular when there are multiple people speaking. The present document addresses the technical problem of designing appropriate two-dimensional (2D) or three-dimensional (3D) scenes for an audio conference which allow a listener to easily distinguish the different talkers of the audio conference. Furthermore, schemes for populating a 2D or 3D conference scene with participants are described, which allow to reduce the impact on an ongoing audio conference, when adding new participants into the conference scene. In addition, appropriate schemes for emphasizing dominant talkers within a conference scene are described.
According to an aspect a conference controller configured to place a plurality of upstream audio signals associated with a plurality of conference participants within 2D or 3D conference scene is described. The conference scene is to be rendered to a listener. Typically, the listener is positioned at a central position of the conference scene (e.g. at the center of a circle or a sphere, if the conference scene is modeled as a circle or a sphere). The plurality of upstream audio signals may be audio signals generated at the terminals (e.g. computing devices or telephone devices) of the corresponding plurality of conference participants. As such, the plurality of upstream audio signals typically comprises the speech signals of the plurality of conference participants. For this reason, the upstream audio signals may also be referred to as talker audio signals. The conference controller may be positioned (at a central position) within a communication network (e.g. in a so called centralized conference architecture) and/or the conference controller may be positioned at a terminal of a conference participant (e.g. in a so called distributed conference architecture). The conference controller may also be referred to as a scene manager, in the instance of using a 2D or 3D rendering system. The conference controller may be implemented using a computing device (e.g. a server).
The conference controller may be configured to set up an X-point conference scene with X different spatial talker locations within the conference scene, X being an integer, X>0 (e.g. X>1, in particular X=1, 2, 3, 4, 5, 6, 7, 8 or 10). In this context, the conference controller may be configured to calculate the X-point conference scene with X different spatial talker locations based on one or more of the conference scene design rules described in the present document. One such design rule may e.g. be that the X talker locations are positioned within a cone around a midline in front of the head of a listener. Other design rules may relate to an angular separation of the X talker locations. Alternatively or in addition, the conference controller may be configured to select the X-point conference scene with the X different spatial talker locations from a set of pre-determined conference scenes comprising pre-determined speaker locations. By way of example, the set may comprise one or more pre-determined X-point conference scenes with X different pre-determined spatial talker locations. As such, the X-point conference scene may be a pre-determined X-point conference scene with X pre-determined speaker locations.
The conference controller may be configured to set up different conference scenes (e.g. different X-point conference scenes with differently placed talker locations and/or conference scenes with different values of X). The X talker locations of the X-point conference scene may be positioned within a cone around a midline in front of the head of the listener. The midline may be an imaginary line starting at a mid point on an imaginary line between the ears of the listener and extending perpendicularly to the imaginary line between the ears of the listener in front of the head of the listener. A generatrix of the cone and the midline may form an (absolute) angle which is smaller than or equal to a pre-determined maximum cone angle. The maximum cone angle may be preferably 30°, or narrower such as 20°, or even 15°, depending on the population of the cone.
The conference controller may be configured to assign the plurality of upstream audio signal to respective ones of the X talker locations. By assigning the plurality of upstream audio signals to particular talker locations within the conference scene, the conference controller enables a rendering device (e.g. a terminal of the listener of the conference scene) to render the plurality of upstream audio signals as if the upstream audio signals emanate from the respective particular talker locations. For this purpose, the conference controller is configured to generate metadata identifying the assigned talker location and enabling an audio processing unit (at a listener's terminal) to generate a spatialized audio signal based on the plurality of upstream audio signals. When rendering the spatialized audio signal to the listener, the listener perceives the plurality of upstream audio signals as coming from the assigned talker locations. The audio processing unit may be positioned within the terminal of the listener, or in the central audio server handling the audio streams. The spatialized audio signal may e.g. be a binaural audio signal which is rendered on headphones or loudspeakers at the terminal of the listener. Alternatively or in addition, the spatialized audio signal may be a multi-channel (surround sound) signal, e.g. a 5.1 or a 7.1 multi-channel signal.
The X talker locations may be placed on a circle or a sphere with the listener being placed in a center of the circle or sphere. Alternative conference scenes may comprise talker locations which are placed on an ellipse or ellipsoid. The listener does not necessarily need to be placed in the center. By way of example, in order to simulate a meeting around a table, wherein the meeting comprises the conference participant and the listener, the listener may be placed at an edge of the geometrical shape forming the conference scene, e.g. at an edge of the circle or sphere, or the ellipse or ellipsoid. In the latter case (as well as in the case where the listener is placed in the center of an ellipse or ellipsoid), the distance between the X talker locations and the listener would be different depending on the talker location.
Two adjacent talker locations of the X talker locations may be separated by at least a minimum angular distance. The minimum angular distance may be 5° or more. The above mentioned condition may be fulfilled by all pairs of adjacent talker locations of the X talker locations. The minimum angular distance allows the listener to clearly distinguish upstream audio signals which are rendered from the different talker locations. The angular distance between adjacent talker locations of the X talker locations may differ for different talker locations. By way of example, the angular distance between adjacent talker locations of the X talker locations may increase with increasing distance of the adjacent talker locations from the midline. By doing this, the varying capability of a listener to distinguish the source of sounds coming from different angles may be taken into account.
The conference controller may be configured to determine a degree of activity of the plurality of upstream audio signals at a time instant. The degree of activity of an upstream audio signal at the time instant may be determined by determining an energy (e.g. a mean squared energy value of the samples) of the upstream audio signal at the time instant. Furthermore, the conference controller may be configured to determine a dominant one of the plurality of upstream audio signals at the time instant based on the degrees of activity of the plurality of upstream audio signals at the time instant. A dominant one of the plurality of upstream audio signals may be determined by determining an upstream audio signal having the highest degree of activity at the time instant. The dominant upstream audio signal may fulfill the criteria that a ratio of the degree of activity of the dominant upstream audio signal and the degree of activity of another upstream audio signal exceeds a pre-determined threshold. In particular, the dominant upstream audio signal may fulfill the criteria that the ratios of the degree of activity of the dominant upstream audio signal and the degrees of activity of all other upstream audio signals exceed the pre-determined threshold.
The conference controller may be configured to emphasize the dominant upstream audio signal at the time instant, thereby enabling the listener of the conference scene to focus on the dominant upstream signal (i.e. on the dominant talker within the audio conference). For this purpose, the conference controller may be configured to generate (or initiate the generation of) a set of downstream audio signals, as well as metadata. The set of downstream audio signals may comprise the dominant upstream audio signal, as well as some or all of the other upstream audio signals. The metadata may identify the talker locations of the plurality of upstream audio signals. The set of downstream audio signals and the metadata may enable the audio processing unit at the listener's terminal to generate a spatialized audio signal, such that when rendering the spatialized audio signal to the listener, the listener perceives the dominant upstream audio signal in an emphasized manner.
The conference controller may be configured to assign the dominant upstream audio signal to a first of the X talker locations. In this case, the dominant upstream audio signal may be emphasized at the time instant by re-assigning the dominant upstream audio signal to a center location within the 2D or 3D conference scene. The center location may be closer to the midline in front of the head of the listener than the first talker location. In other words, the conference controller may be configured to emphasize the dominant upstream audio signal by moving the spatial location of the dominant upstream audio signal to the center of the conference scene. The center location may lie between the two talker locations closest to the midline. Alternatively, the center location may correspond to the talker location closest to the midline. In this case, the conference controller may be configured to re-assign an upstream audio signal already assigned to the talker location closest to the midline to another talker location within the conference scene.
The conference controller may be configured to emphasize the dominant upstream audio signal at the time instant by increasing a rendering volume of the dominant upstream audio signal at the time instant. Alternatively or in addition, the conference controller may be configured to emphasize the dominant upstream audio signal at the time instant by moving the first talker location (i.e. the talker location assigned to the dominant upstream audio signal) closer to the listener. This may be achieved by modifying the reverberation parameters, in particular the ratio of the direct and reverberant components of the processed signal, of the conference scene for the dominant upstream audio signal.
The conference controller may be configured to emphasize the dominant upstream audio signal at the time instant by rotating the 2D or 3D conference scene around the head of the listener. Rotating the conference scene typically comprises rotating the talker locations to yield updated talker locations. Subsequent to rotating the conference scene, the upstream audio signals may be placed at one or more respective updated talker locations. The conference controller may be configured to rotate the conference scene such that the updated talker location of the dominant upstream signal is the updated talker location closest to a midline in front of the head of the listener. By rotating the conference scene in such a manner, the rotation of the head of the listener towards the dominant talker is simulated.
The conference controller may be configured to emphasize the dominant upstream audio signal at the time instant by modifying a height of the first talker location relative to the others of the X spatial talker locations. In particular, the conference controller may be configured to increase the height of the first talker location relative to the others of the X spatial talker locations. As outlined in the present document, the X talker locations may be defined using respective azimuth angles and inclination angles. As such, the conference controller may be configured to modify (e.g. increase) an inclination angle of the first talker location relative to the inclination angles of the others of the X spatial talker locations.
The conference controller may be configured to rotate the conference scene such that all updated talker locations are positioned within the above mentioned cone around the midline. Therefore, the amount of rotation within the scene will be different for each of the upstream audio signals. For this purpose, the conference controller may be configured to reduce an angular distance between adjacent talker locations, in order to determine the updated talker locations.
Typically, the conference controller is configured to repeat the determination of a dominant upstream audio signal for a plurality of succeeding time instants. At each time instant the dominant upstream audio signal may be determined. If the dominant upstream audio signal remains unchanged, the emphasis of the current dominant upstream audio signal may be maintained. On the other hand, if a new dominant upstream audio signal is determined, the former dominant upstream audio signal may be de-emphasized (by removing or reversing any of the above mentioned emphasizing schemes) and the new dominant upstream audio signal may be emphasized (according to any of the above mentioned emphasizing schemes). As such, the conference controller may be configured to determine a different new dominant one of the plurality of upstream audio signals at a second time instant after the time instant. In such a situation, the former dominant upstream audio signal may be de-emphasized at the second time instant, and the new dominant upstream audio signal may be emphasized at the second time instant.
The conference controller may be configured to classify the X spatial talker locations into a plurality of clusters, wherein a first of the plurality of clusters comprises at least two spatial talker locations. The spatial talker locations comprised within the first cluster may be directly adjacent. The clustering of spatial talker locations may be used to group the plurality of upstream audio signals according to clusters (e.g. according to departments or functions of a company). The conference controller may be configured to classify the X spatial talker locations into a plurality of clusters dependent upon classification metadata. The classification metadata may comprise an identifier associated with an electronic means of communication of a conference participant. The identifier may comprise an electronic mail address of a conference participant. The classification metadata may comprise an identifier associated with a physical location of a conference participant. The identifier may be encoded using dual-tone multi-frequency (DTMF) signaling. One or more of the plurality of upstream audio signals may comprise the classification metadata. The conference controller may be configured to extract the classification metadata from one or more of the plurality of upstream audio signals. The conference controller may be configured to facilitate input of the classification metadata by a conference participant.
According to another aspect, an audio conferencing system is described. The audio conferencing system comprises a plurality of talker terminals configured to generate a plurality of upstream audio signals associated with a plurality of conference participants, respectively (e.g. using microphones at the talker terminals). Furthermore, the audio conferencing system comprises a conference controller according to any of the aspects described in the present document. The conference controller is configured to assign the plurality of upstream audio signals to respective talker locations within a 2D or 3D conference scene, and to determine and to emphasize a dominant one of the plurality of upstream audio signals. In addition, the audio conferencing system comprises a listener terminal configured to render the dominant upstream audio signal to a listener, such that the listener perceives the dominant upstream audio signal in an emphasized manner.
According to a further aspect, a method for placing a plurality of upstream audio signals associated with a plurality of conference participants within a 2D or 3D conference scene to be rendered to a listener is described. The method comprises setting up a X-point conference scene with X different spatial talker locations within the conference scene, X being an integer, X>0. Furthermore, the method comprises assigning the plurality of upstream audio signals to respective (different) ones of the talker locations. The method proceeds in determining a degree of activity of the plurality of upstream audio signals at a time instant, and in determining a dominant one of the plurality of upstream audio signals at the time instant based on the degrees of activity of the plurality of upstream audio signals at the time instant. Furthermore, the method comprises emphasizing the dominant upstream audio signal at the time instant.
According to a further aspect, a software program is described. The software program may be adapted for execution on a processor and for performing the method steps outlined in the present document when carried out on the processor.
According to another aspect, a storage medium is described. The storage medium may comprise a software program adapted for execution on a processor and for performing the method steps outlined in the present document when carried out on a computing device.
According to a further aspect, a computer program product is described. The computer program may comprise executable instructions for performing the method steps outlined in the present document when executed on a computer.
It should be noted that the methods and systems including its preferred embodiments as outlined in the present patent application may be used stand-alone or in combination with the other methods and systems disclosed in this document. Furthermore, all aspects of the methods and systems outlined in the present patent application may be arbitrarily combined. In particular, the features of the claims may be combined with one another in an arbitrary manner.