Conventional circuit-switched (CS) teleconferencing systems typically employ monophonic (“mono”) codecs. Examples of conventional CS conferencing systems of this type are the well-known Global System for Mobile Communications (GSM) and Universal Mobile Telecommunications System (UMTS) CS networks. Each monophonic encoded audio signal transmitted between nodes of such a system can be decoded and rendered to generate a single speaker feed for driving a speaker set (typically a single loudspeaker or a headset). However, the speaker feed cannot drive the speaker set to emit sound perceivable by a listener as originating at apparent source locations distinct from the actual location(s) of the loudspeaker(s) of the speaker set.
Even when a participant in a multi-participant telephone call implemented by a conventional CS conferencing system of this type uses an endpoint (e.g., a mobile phone) coupled to a multi-transducer headset or pair of headphones, if the endpoint generates a single speaker feed to drive the headset or headphones, the participant is unable to benefit from any spatial voice rendering technology that might otherwise improve the user's experience by providing better intelligibility via the spatial separation of voices of different participants. This is because the endpoint of such a conventional CS system cannot generate (in response to a received mono audio signal) multiple speaker feeds for driving multiple speakers to emit sound perceivable by a listener as originating from different conference participants, each participant at a different apparent source location (e.g., participants at different apparent source locations distinct from the actual locations of the speakers).
Conventional packet-switched (PS) conferencing systems can be configured to send to an endpoint a multichannel audio signal (e.g., with different channels of audio sent in different predetermined slots or segments within a packet, or in different packets) and optionally also metadata (e.g., in different packets, or different predetermined slots or segments within packets, than those in which the audio is sent). For example, UK Patent Application GB 2,416,955 A, published on Feb. 8, 2006, describes conferencing systems configured to send to endpoints a multichannel audio signal (with each channel comprising speech uttered at a different endpoint) and metadata (a tagging identifier for each channel) identifying the endpoint at which each channel's content originated, with each receiving endpoint configured to implement spatial voice rendering technology to generate multiple speaker feeds in response to the transmitted audio and metadata. Conventional PS conferencing systems could also be configured to send a mono audio signal and associated metadata, with the audio and metadata in different packets (or in different predetermined slots or segments within a packet), where the mono signal together with the metadata are sufficient to enable generation of a multichannel audio signal in response to the mono signal. Each receiving endpoint of such a system could be configured to implement spatial voice rendering technology to generate multiple speaker feeds (in response to transmitted multichannel audio, or mono audio with metadata of the above-noted type) for driving multiple speakers to emit sound perceivable by a listener as originating from different conference participants, each participant at a different apparent source location. Of course, each node (endpoint or server) of the system would need to share a protocol for interpretation of the transmitted data. Thus, a conventional decoder (which does not implement the protocol required to identify and distinguish between different channels of transmitted multi-channel audio, or between metadata and monophonic audio transmitted in different packets or different slots or segments of a packet) could not be used in a receiving endpoint which renders the transmitted audio as an output soundfield. Rather, a special decoder (which implements the protocol required to distinguish between different channels of transmitted multi-channel audio, or between transmitted metadata and monophonic audio) would be needed.
In contrast, a conventional teleconferencing system (e.g., a conventional CS teleconferencing system) can be modified in accordance with typical embodiments of the present invention to become capable of generating mixed monophonic audio and metadata (meta information) regarding conference participants, and encoding the mixed monophonic audio and metadata for transmission over a link (e.g., a mono audio channel of the link) of the system, without any need for modifying the encoding scheme (e.g., a standardized encoding scheme) or decoding scheme (e.g., a standardized decoding scheme) implemented by any node of the system. A conventional decoder could decode the encoded, transmitted signal to recover the mixed monophonic audio and metadata, and simple processing would typically then be performed on the recovered mixed monophonic audio and metadata to identify the metadata (and typically also to remove, e.g., by notch filtering, the metadata from the monophonic audio).
Typical embodiments of the invention employ the simple but efficient idea of in-band signaling using tones, in the context of transmitting metadata tones mixed with monophonic audio (indicative of a dominant teleconference participant) to enable rendering of the monophonic audio as a soundfield. An example of conventional use of in-band signaling using tones is the transmission of Dual-Tone Multiple Frequencies (DTMF) tones, widely implemented in current telecommunications systems (although not for the purpose of carrying spatial audio information, or metadata enabling the rendering of monophonic teleconference audio as a soundfield).