Any discussion of the background art throughout the specification should in no way be considered as an admission that such art is widely known or forms part of common general knowledge in the field.
Video and audio teleconferencing systems where multiple parties interact remotely to carry out a conference are an important resource. Many such systems are known. Most rely on a central or distributed server resource to ensure each participant is able to hear and/or see the other participants using, for example, dedicated teleconferencing devices, standard computer resources with audio/input output facilities or Smart Phone type devices. The central or distributed server resource is responsible for appropriately mixing uplinked audio signals together from each conference participant and downlink the audio signals for playback by each audio output device.
By way of background, in a typical (known) teleconferencing system a mixer receives a respective ‘uplink stream’ from each of the telephone endpoints, which carries an audio signal captured by that telephone endpoint, and sends a respective ‘downlink stream’ to each of the telephone endpoints; thus each telephone endpoint receives a downlink stream which is able to carry a mixture of the respective audio signals captured by the other telephone endpoints. Accordingly, when two or more participants in a telephone conference speak at the same time, the other participant(s) can hear both participants speaking.
It is known (and usually desirable) for the mixer to employ an adaptive approach whereby it changes the mixing in response to perceiving certain variations in one or more of the audio signals. For example, an audio signal may be omitted from the mixture in response to determining that it contains no speech (i.e. only background noise).
Consider a teleconferencing system in which telephone endpoints each send an uplink audio stream to a teleconferencing mixer. In such a system, the uplinks and downlinks may be encoded digitally and transmitted via a suitable packet-switched network, such as a voice over internet protocol (VoIP) network, or they may travel over a circuit-switched network, such as the public switched telephone network (PSTN). Either way, it is the mixer's responsibility to produce a downlink audio stream to send back to each endpoint such that, in general, each participant hears every other participant except himself.
One class of endpoint in such a system employs discontinuous transmission (DTX) on the uplink. Such an endpoint attempts to maximise intelligibility while minimising the use of network resources by one of more of: employing microphone placements close to the talkers' mouths; noise suppression signal processing which remove background noise; only sending the uplink stream when human speech is present.
This strategy may result in less aberrant noise being heard by the listener, but it may also result in a less natural-sounding experience, firstly because noise suppression signal processing typically results in the introduction of disturbing dynamic artefacts when the background noise is non-stationary, secondly because the noise suppression affects the equalisation of the speech and thirdly because the binary transmit/don't transmit decision, based on imperfect information from a voice activity detector (VAD), will sometimes lead to speech being cut off and at other times lead to residual noise being transmitted as speech. Thus, an audio stream received from a DTX device is an example of an audio input stream which is expected to include no more than a negligible amount of human-perceivable background noise.
A second class of endpoint employs continuous transmission (CTX) on the uplink. That is, a CTX endpoint sends an audio stream regardless of whether the VAD (if present) determines that speech is present or not. Here the intention is often to maximise the naturalness of the listening experience and allow a remote listener to perform the well-known cocktail party problem of binaural processing just as if he or she were present in person. Accordingly, a CTX endpoint may employ multiple microphones to retain spatial diversity to allow binaural release from masking. The designer of a CTX device may also seek to limit the amount of noise suppression processing that the device performs in order to minimise the potential for disturbing dynamic artefacts and spectral colouration. Thus, an audio stream received from a CTX device is an example of an audio input stream which is expected to include more than a negligible amount of human-perceivable background noise.