Embodiments according to the present invention relate to mixing a plurality of input data streams to obtain an output data stream and generating an output data stream by mixing first and second input data streams, respectively. The output data stream may, for instance, be used in the field of conferencing systems including video conferencing systems and teleconferencing systems.
In many applications more than one audio signal is to be processed in such a way that from the number of audio signals, one signal, or at least a reduced number of signals is to be generated, which is often referred to as “mixing”. The process of mixing of audio signals, hence, may be referred to as bundling several individual audio signals into a resulting signal. This process is used for instance when creating pieces of music for a compact disc (“dubbing”). In this case, different audio signals of different instruments along with one or more audio signals comprising vocal performances (singing) are typically mixed into a song.
Further fields of application, in which mixing plays an important role, are video conferencing systems and teleconferencing systems. Such a system is typically capable of connecting several spatially distributed participants in a conference by employing a central server, which appropriately mixes the incoming video and audio data of the registered participants and sends to each of the participants a resulting signal in return. This resulting signal or output signal comprises the audio signals of all the other conference participants.
In modern digital conferencing systems a number of partially contradicting goals and aspects compete with each other. The quality of a reconstructed audio signal, as well as applicability and usefulness of some coding and decoding techniques for different types of audio signals (e.g. speech signals compared to general audio signals and musical signals), have to be taken into consideration. Further aspects that may have to be considered also when designing and implementing conferencing systems are the available bandwidth and delay issues.
For instance, when balancing quality on the one hand and bandwidth on the other hand, a compromise is in most cases inevitable. However, improvements concerning the quality may be achieved by implementing modern coding and decoding techniques such as the AAC-ELD technique (AAC=Advanced Audio Codec; ELD=Enhanced Low Delay). However, the achievable quality may be negatively affected in systems employing such modern techniques by more fundamental problems and aspects.
To name just one challenge to be met, all digital signal transmissions face the problem of a necessitated quantization, which may, at least in principle, be avoidable under ideal circumstances in a noiseless analog system. Due to the quantization process inevitably a certain amount of quantization noise is introduced into the signal to be processed. To counteract possible and audible distortions, one might be tempted to increase the number of quantization levels and, hence, increase the quantization resolution accordingly. This, however, leads to a greater number of signal values to be transmitted and, hence, to an increase of the amount of data to be transmitted. In other words, improving the quality by reducing possible distortions introduced by quantization noise might under certain circumstances increase the amount of data to be transmitted and may eventually violate bandwidth restrictions imposed on a transmission system.
In the case of conferencing systems, the challenges of improving a trade-off between quality, available bandwidth and other parameters may be even further complicated by the fact that typically more than one input audio signal is to be processed. Hence, boundary conditions imposed by more than one audio signal may have to be taken into consideration when generating the output signal or resulting signal produced by the conferencing system.
Especially in view of the additional challenge of implementing conferencing systems with a sufficiently low delay to enable a direct communication between the participants of a conference without introducing substantial delays which may be considered unacceptable by the participants, further increases the challenge.
In low delay implementations of conferencing systems, sources of delay are typically restricted in terms of their number, which on the other hand might lead to the challenge of processing the data outside the time-domain, in which mixing of the audio signals may be achieved by superimposing or adding the respective signals.
Generally speaking it is favorable to choose a trade-off between quality, available bandwidth and other parameters suitable for conferencing systems carefully in order to cope with the processing overhead for mixing in real time, lower the hardware amount needed, and keep the costs in terms of hardware and transmission overhead reasonable without compromising the audio quality.
To reduce an amount of data transmitted, modern audio codecs often utilize highly sophisticated tools to describe spectral information concerning spectral components of a respective audio signal. By utilizing such tools, which are based on psycho-acoustic phenomena and examination results, an improved trade-off between partially contradicting parameters and boundary conditions such as the quality of the reconstructed audio signal from the transmitted data, computational complexity, bitrate, and further parameters can be achieved.
Examples for such tools are for example perceptual noise substitution (PNS), temporal noise shaping (TNS), and spectral band replication (SBR), to name but a few. All these techniques are based on describing at least part of spectral information with a reduced number of bits so that, compared to a data stream based on not using these tools, more bits can be allocated to spectrally important parts of the spectrum. As a consequence, while maintaining the bitrate, a perceptible level of quality may be improved by using such tools. Naturally, a different trade-off may be selected, namely to reduce the number of bits transmitted per frame of audio data maintaining the overall audio impression. Different trade-offs lying in between these two extreme may also be equally well realized.
These tools may also be used in telecommunication applications. However, when more than two participants in such a communications situation are present, it may be very advantageous to employ a conferencing system for mixing two or more bit streams of more than two participants. Situations like these occur in both, purely audio-based or teleconferencing situations, as well as video conferencing situations.
A conferencing system operating in a frequency domain is, for instance, described in US 2008/0097764 A1 which performs the actual mixing in the frequency domain and, thereby, omitting retransforming the incoming audio signals back into the time-domain.
However, the conferencing system described therein does not take into account the possibilities of tools as described above, which enable a description of spectral information of at least one spectral component in a more condensed manner. As a result, such a conferencing system necessitates additional transformation steps to reconstruct the audio signals provided to the conferencing system at least to such a degree that the respective audio signals are present in the frequency domain. Moreover, the resulting mixed audio signal also needs to be retransformed based on the additional tools mentioned above. These retransformation and transformation steps require, however, an application of complex algorithms, which may lead to an increased computational complexity and, for instance, in the case of portable, energetically critical applications, to an increased energy consumption and, hence, to a limited operational time.