Conferencing capability allows for group communication and collaboration among geographically dispersed participants (also called users below). Historically, conferencing has been achieved in the Public Switched Telephone Network (PSTN) by means of a centralized conference bridge. In large-scale audio and video conferencing systems media mixers are cascaded to support large number of users connecting from different locations. To be able to maintain low end-to-end delay, each intermediate node must minimize the delay. Each node must at the same time ensure that the delay introduced is enough to generate almost degradation-free media.
In such a circuit switched network, the mixing of real-time media streams from several users can usually be performed without causing any substantial additional delay. In e.g. a voice teleconference, the individual audio samples from the participants are synchronized and arrive at regular time intervals. This means that the samples can be scheduled to be processed at regular time intervals and no additional delay is added except the time needed for the processing. The processing for a voice teleconference usually consist of determining which talkers that are active and summing the speech contribution from the active talkers.
Currently trends point towards the migration of voice communication services from the circuit-switched PSTN to non-synchronous packet-based Internet Protocol (IP) networks. This shift is motivated by a desire to provide data and voice services on a single, packet-based network infrastructure. In a packet non-synchronous network, the audio samples (or coded parameters representing the audio samples) from the participants in e.g. a voice teleconference do usually not arrive at regular time intervals due to the jitter in the transport network. Also, the speech data from the individual participants might not be sampled with exactly the same sample frequency, thus introducing a drift in the data from the participants.
In order to synchronize the speech contributions from the participants and thus making it possible to mix samples corresponding to temporally related packets from all participants, jitter buffers are typically implemented in the conference bridge on the incoming speech to cater for the varying delay of the packets. With conventional jitter buffers the size of the buffers will be at least as big as the jitter to be able to avoid late losses. Cascaded mixers will buffer the incoming media several times causing the end-to-end delay to grow.
Another solution, called early mixing, does not utilize jitter buffers at all. Instead packets are mixed as soon as all temporally related packets have arrived at the mixer. This includes setting a waiting time that takes into account the delay. Even though early mixing solutions may decrease the end-to-end delay compared to conventional static jitter buffers it will not provide an optimized solution.
Due to the above mentioned disadvantages, there is a need for a mixing solution that improves the mixing in a packet based network without introducing unnecessary delay.