Field of the invention
The present disclosure relates generally to a communication system for audio conferencing (also called teleconferencing, with audio conferencing also called voice conferencing). More specifically, embodiments of the present disclosure relate to audio capture and render devices that include a display of the state of the conference, and a user interface for manipulating the audio and the display for use in audio or video conferencing systems and methods for use in endpoints for audio or video conferencing.
Background
A communication system operating as an audio or voice conferencing system allows a possibly large number of participants (also commonly called communicants and conferees, and used interchangeably herein) to communicate by voice simultaneously. The term voice conferencing system is also used to denote the voice portion of a video conferencing system such as a telepresence system. The term “conference system” it to be understood to mean an audio communication system in which a communicant receives at a node (a endpoint) audio captured at other endpoints) and sends audio captured at the code to other endpoints. A conference system may be the audio portion of a video conferencing system, unless otherwise noted.
Communicants may join a conference via their respective endpoints. The endpoints are generally provided with one or more microphones for audio input and one or more loudspeakers or headphones for audio output. The endpoints may access the conference system via a communication link, such a link including one or more of: network connections, wired telecommunication connections, wireless connection, and so forth. The phrase “endpoints coupled by a network” includes these possibilities, and also includes direct connection.
Audio capture and render device are known for use in audio-only conference endpoints and in video conference endpoints. In the remainder of this disclosure, the term “endpoint” will be used for such an audio capture and render device, typically acting as a client in client server system architecture and/or as a peer in peer-to-peer system architecture. An endpoint may also include video capture and display capabilities.
Some conferencing systems include processing to make use spatial properties of audio at an endpoint, called a soundfield-capturing endpoint that has spatial processing capabilities. An example, each remote communicant may be given different spatial properties at the soundfield-capturing endpoint. Such soundfield-capturing endpoints are known to help create a sense of presence to a listening communicant, including providing help in differentiating between different speaking communicants, and may provide for more than one remote communicant to be speaking at the same time, and still be heard. Leveling signal processing may be used to provide the same perceived loudness level to the remote speaking communicants.
A soundfield-capturing endpoint includes a microphone array and processing thereof, and may include a plurality of loudspeakers, e.g., in headphones or as two or more spatially separated loudspeakers not in headphones, together with a rendering engine to render audio data spatially to provide a listener with a sense of space.
By an auditory scene (or simply scene if the context is clear) is meant a representation of discrete sound objects present in an acoustical environment, e.g., in a room, whether a real environment, or an artificial environment, e.g., a virtual room. A sound object is also commonly called an auditory object and an audio object, and these terms are used interchangeably in the present disclosure. The acoustical environment of a typical conference room may be represented as an auditory scene of a set of sound objects, including the communicants, and so-called nuisance objects, such as background sounds, whether background human speech, background music, or other background noise that may interfere with any of the communicants. Each object has associate properties, such as in the case of a communicant, the location or direction of the communicant, whether or not the communicant is speaking, the loudness of the speech, noise parameters, and so forth.
It is also known include a microphone array in a soundfield-capturing endpoint to capture a soundfield. It is also known to carry out auditory scene analysis (ASA) on the captured soundfield to determine parameters of the local auditory scene, such as detecting and distinguishing the sound objects that are part of the scene, and determining one or more object parameters such as one or more of: whether communicant sound object is speaking, the loudness, the location, the reverberation of the sound from the sound object, the harmonicity, the noise level, gains that are usable for leveling, and, for one or more of these parameters, the levels of confidence in the measure.
It is also known to encode these parameters, to encode captured audio, and to send the parameters with the audio to a remote endpoint, e.g., via a conference server.
It is also known to include a spatial sound renderer coupled to a plurality of loudspeakers or a set of headphones so that different sound objects are perceived as sound sources emitting from respective discrete locations, either the same locations as captured at the remote endpoint, or synthetically located.
In business meetings in which audio signals (e.g., audio signals delivered by communication systems) indicative of communicant speech are reproduced, an important component of the audio processing of the signals is leveling of segments of the signals which are indicative of speech of different talkers. People speak at various levels in a meeting and it is typically necessary for an audio processing system to actively adjust the levels of different segments of an audio signal to ensure that the perceived loudness of each communicant's speech is consistent. How to carry out leveling is known in the art.
A visual display of one or more parameters for each local communicant in a local room would be useful. A simultaneous display of parameters related to the remote communicants also would have utility. Furthermore, it would be useful to provide capability for a communicant in the local room to interact with and affect the visual display. Further, it would be useful to have a visual display of information for both the remote communicants and the local communicants.
An example of a relatively expensive endpoint is an endpoint of a telepresence system that includes several life-size display screens, and one or more additional screens for control and for display of information. A spatial audio conferencing endpoint need not be a relatively expensive device. For example, a simple “speakerphone” conference endpoint is known. Such simple voice endpoints such as the familiar triangular or circular devices are popular and will remain so because of their ease of use, size, simplicity, and cost. The user interfaces of such relatively inexpensive known endpoints may not include all the information that would be beneficial for communicants. It would be beneficial to include spatial processing in such an endpoint and include a simple compact display that simultaneously and efficiently displays a plurality of properties to local communicants regarding the state of the conference, in particular, information regarding the remote and/or local communicants.