Conference calls are a common way of enabling virtual meetings to be held between remote participants. During a conference call between remote participants on a variety of devices, the quality of the audio signal is incredibly important regardless of whether the conference call is audio-only, a video conference, or a combination of both.
With improving technology, the variety of devices used by different participants to access virtual meetings has increased. Different participants of a single virtual meeting or conference call may for example use one or more of a smartphone, tablet, laptop, video endpoint or Lync client to access the meeting.
The purpose of a network audio mixer is to enable audio conferencing functionality between the participants.
Each participant of the meeting will contribute an audio stream via the microphone of their chosen device. This audio stream will be compressed locally resulting in a stream of Real-time Transport Protocol (RTP) packets.
This is usually achieved by way of a standard audio codec such as G.722 or AAC-LD. However, different audio generating/receiving devices are likely to use different audio sample rates. For example, a high-end video conference suite is likely to be configured to send and receive higher sample rates than a mobile phone. Typically audio compression standards in voice-over internet protocol (VOIP) will use sample rates such as 8, 16, 32 and 48 Khz.
The audio mixer will mix the packets from all of the participants, and will send back to each participant a stream of compressed audio packets which enable the participant to hear every other participant in the conference apart from themselves.
A schematic diagram of an example of a conventional audio network mixer 10 is shown in FIG. 1. Each participant generates an audio stream which must be decoded by a decoder 1a, 1b, 1c for that audio stream before the audio signals can be mixed. Each audio stream may use a different sample rate and different audio codec.
Generally, each audio signal therefore also needs to be resampled to a common format prior to mixing. Each audio stream will therefore pass through a resampler 2a, 2b, 2c. Once fully decoded and converted (resampled) into a common format, the resampled signals are mixed together in a single mixer 3. Usually, the common format used for mixing corresponds to the highest sample rate used by any of the participants.
For each participant, their own resampled input signal is then subtracted from the mixed signal to produce an output which must then be converted back (resampled) into a suitable format to be encoded by a separate encoder 6a, 6b, 6c for each participant.
In the example shown in FIG. 1, each decoder may be located after a jitter buffer 7a, 7b, 7c. 
For a conference with N participants, a conventional audio mixer such as that of FIG. 1 requires up to N decodes, 2N audio resamplings, and N encodes. Thus, each “audio path” through the mixer (from the audio signal sent out by a participant to the audio signal they receive back) typically gets resampled twice which can lead to a loss of quality. In addition, resamplers are expensive components which demand a high computational load.