Embodiments according to the present invention relate to apparatuses for mixing a plurality of input data streams to obtain an output data stream, which may for instance be used in the field of conferencing systems including video conferencing systems and teleconferencing systems.
In many applications more than one audio signal is to be processed in such a way that from the number of audio signals, one signal, or at least a reduced number of signals is to be generated, which is often referred to as “mixing”. The process of mixing of audio signals, hence, may be referred to as bundling several individual audio signals into a resulting signal. This process is used for instance when creating pieces of music for a compact disc (“dubbing”). In this case, different audio signals of different instruments along with one or more audio signals comprising vocal performances (singing) are typically mixed into a song.
Further fields of application, in which mixing plays an important role, are video conferencing systems and teleconferencing systems. Such a system is typically capable of connecting several spatially distributed participants in a conference by employing a central server, which appropriately mixes the incoming video and audio data of the registered participants and sends to each of the participants a resulting signal in return. This resulting signal or output signal comprises the audio signals of all the other conference participants.
In modern digital conferencing systems a number of partially contradicting goals and aspects compete with each other. The quality of a reconstructed audio signal, as well as applicability and usefulness of some coding and decoding techniques for different types of audio signals (e.g. speech signals compared to general audio signals and musical signals), have to be taken into consideration. Further aspects that may have to be considered also when designing and implementing conferencing systems are the available bandwidth and delay issues.
For instance, when balancing quality on the one hand and bandwidth on the other hand, a compromise is in most cases inevitable. However, improvements concerning the quality may be achieved by implementing modern coding and decoding techniques such as the AAC-ELD technique (AAC=Advanced Audio Codec; ELD=Enhanced Low Delay). However, the achievable quality may be negatively affected in systems employing such modern techniques by more fundamental problems and aspects.
To name just one challenge to be met, all digital signal transmissions face the problem of an essential quantization, which may, at least in principle, be avoidable under ideal circumstances in a noiseless analog system. Due to the quantization process inevitably a certain amount of quantization noise is introduced into the signal to be processed. To counteract possible and audible distortions, one might be tempted to increase the number of quantization levels and, hence, increase the quantization resolution accordingly. This, however, leads to a greater number of signal values to be transmitted and, hence, to an increase of the amount of data to be transmitted. In other words, improving the quality by reducing possible distortions introduced by quantization noise might under certain circumstances increase the amount of data to be transmitted and may eventually violate bandwidth restrictions imposed on a transmission system.
In the case of conferencing systems, the challenges of improving a trade-off between quality, available bandwidth and other parameters may be even further complicated by the fact that typically more than one input audio signal is to be processed. Hence, boundary conditions imposed by more than one audio signal may have to be taken into consideration when generating the output signal or resulting signal produced by the conferencing system.
Especially in view of the additional challenge of implementing conferencing systems with a sufficiently low delay to enable a direct communication between the participants of a conference without introducing substantial delays which may be considered unacceptable by the participants, further increases the challenge.
In low delay implementations of conferencing systems, sources of delay are typically restricted in terms of their number, which on the other hand might lead to the challenge of processing the data outside the time-domain, in which mixing of the audio signals may be achieved by superimposing or adding the respective signals.
For improving the trade-off between quality and bitrate in the case of general audio signals, a significant number of techniques exist which are capable of further improving a trade-off between such contradicting parameters such as quality of a reconstructed signal, bitrate, delay, computational complexity and further parameters.
A highly flexible tool to improve the previously mentioned trade-off is the so-called spectral band representation tool (SBR). The SBR-module is typically not implemented to be part of a central encoder, such as the MPEG-4 AAC encoder, but is rather an additional encoder and decoder. SBR utilizes a correlation between higher and lower frequencies within an audio signal. SBR is based on the assumption that higher frequencies of a signal are merely integer multiples of a ground oscillation so that the higher frequencies can be replicated on the basis of the lower spectrum. Since the audible resolution of the human ear in the case of higher frequencies logarithmically, the low differences concerning higher frequency ranges may furthermore only be realized by very experienced listeners so that inaccuracies introduced by the SBR encoder will, most probably, be unnoticed by the vast majority of listeners.
The SBR encoder preprocesses the audio signal provided to the MPEG-4 encoder and separates the input signal into frequency ranges. The lower frequency range or frequency band is separated from an upper frequency band or frequency range by a so-called cross-over frequency, which can be set variably, depending on the available bitrate and further parameters. The SBR encoder utilizes a filterbank for analyzing the frequency, which is typically implemented to be a quadrature mirror filter band (QMF).
The SBR encoder extracts from the frequency representation of the upper frequency range energy values, which will later be used for reconstructing this frequency range based on the lower frequency band.
The SBR encoder, hence, provides SBR-data or SBR parameters along with a filtered audio signal or filtered audio data to a core encoder, which is applied to the lower frequency band based on half the sampling frequency of the original audio signal. This provides the opportunity of processing significantly less sample values so that the individual quantization levels may be more accurately set. The additional data provided by the SBR encoder, namely the SBR parameters, will be stored into a resulting bit stream by the MPEG-4 encoder or any other encoder as side information. This may be achieved by using an appropriate bit multiplexer.
On the decoder side, the incoming bit streams is first de-multiplexed by a bit demultiplexer, which separates at least the SBR-data and provides same to a SBR decoder. However, before the SBR decoder processes the SBR parameters, the lower frequency band will first be decoded by a core decoder to reconstruct the audio signal of the lower frequency band. The SBR decoder itself calculates, based on the SBR energy values (SBR parameters) and the spectral information of the lower frequency range, the upper part of the spectrum of the audio signal. In other words, the SBR decoder replicates the upper spectral band of the audio signal based on the lower band as well as the SBR parameters transmitted in the previously described bit stream. Apart from the previously described possibility of the SBR-module, to enhance the overall audio perception of the reconstructed audio signal, SBR furthermore offers the possibility of encoding additional noise sources as well as individual sinusoids.
SBR, hence, represents a very flexible tool to improve the trade-off between quality and bitrate which also makes SBR an interesting candidate for applications in the field of conferencing systems. However, due to the complexity and vast number of possibilities and options, SBR-encoded audio signals have only been so far mixed in the time-domain by completely decoding the respective audio signals into time-domain signals to perform the actual mixing process in this domain and, afterwards, re-encode the mixed signal into an SBR-encoded signal. Apart from the additional delay introduced due to encoding the signals into the time-domain, also the reconstruction of the spectral information of the encoded audio signal may necessitate a significant computational complexity which may, for instance, be unattractive in the case of portable or other energy-efficient or computational complexity efficient applications.