Teleconferencing is widely used, e.g., as an alternative to meeting in person. The possibility of teleconferencing reduces the need to travel to a certain location to attend a meeting, which saves both time and money, and furthermore is environmental friendly. However, a high perceived sound quality is important in order for teleconferencing to be a satisfactory alternative to a meeting in person
Below, an example of a teleconferencing system according to the prior art is described with reference to FIGS. 1 and 2. FIG. 1 shows a schematic view of a multi party teleconferencing system with N users 102:1-N, here represented by UEs (User Equipments), N channels 106:1-N, and a conferencing bridge, in the form of a Multipoint Control Unit (MCU) 104. Each UE has a microphone, a loudspeaker, and signal processing capabilities for, e.g., signal capture, coding and transmission, signal reception, decoding and playback. The UEs 102:1-N send speech or audio signals recorded and encoded at their respective ends to the MCU 104, which decodes these signals from all channels into a PCM (Pulse Code Modulation) representation. After that the PCM signals are digitally mixed, re-encoded and finally transmitted to the connected UEs.
This principle is further illustrated in FIG. 2, which shows the specific signal processing flow for an exemplary UE “K”, 202:K. The UE 202:K comprises an encoder 206:K, for encoding a signal, received by a microphone 216:K and typically subject to some signal processing, to be sent to an MCU 204. In the MCU 204, the encoded signal from UE 202:K is decoded using a decoder 208:K. The MCU 204 comprises a set of decoders, 208:1-N, for decoding the respective signals arriving from the different parties taking part in a teleconference. The decoded signals, which are in PCM-representation, are then mixed, e.g. added together, in a Mixer 210. Then, the mixed signal, which is to be provided to the participant UE K, is encoded in an encoder 212:K, and is, when received by UE 202:K, decoded using a decoder 214:K.
For certain reasons, e.g. to reduce the background noise level of the transmitted signal, some implementations of multi party bridges only mix the incoming signals from a fixed subset of the parties, e.g. 3 or 4. The subset of parties is typically selected on the basis of signal level and speaker activity of the different parties, where the signals of the most recent active speakers are retained in the subset if no speaker activity is present from any other party. Another possible modification to the basic operation illustrated in FIG. 2 is that the signal coming from party K may be excluded from the sum of signals transmitted back to party K. The reason for this is that since there is a significant transmission delay present in the system, the microphone signal, transmitted forth and back to and from the MCU would be perceived as an undesirable echo when emitted from the loudspeaker 218:K. Instead, typically, the microphone signal from party K is presented in the loudspeaker 218:K of UE K, as the so-called side-tone that is generated locally in the UE.
There are certain types of speech codecs that allow mixing of the signals received from the different channels in the coded speech domain or the speech codec parameter domain. For this class of codecs the decoders 208 and encoders 212 can be omitted or at least reduced to mappings between coded speech and speech codec parameter domains.
Scalable Codecs
Scalable, or embedded, coding is a coding paradigm in which the coding of signals is done in layers. A block diagram illustrating the basic principle of scalable codecs is shown in FIG. 3. In a base, or core, layer 306, the signal is encoded at a low bit rate, while additional layers 308, each on top of the previous layer, provide some enhancement relative to the coding which is achieved in all layers from the core up to the respective previous layer. Each layer adds some additional bit rate. The generated bit stream is embedded, meaning that the bit stream of lower-layer encoding is embedded into the bit streams of higher layers. This property makes it possible, anywhere in the transmission or in a receiver, to drop the bits belonging to one or more higher layers. Such a “stripped” bit stream can still be decoded up to the layer of which the bits are retained. Therefore, scalable coding is suitable for use in bandwidth limited services involving multiple parties with different requirements, such as e.g. teleconferencing, and especially over wireless links of limited and/or potentially varying bandwidth.
One example of using scalable speech codecs in multi-party conferencing systems is described in [7]. According to said publication, it is foreseen to use the scalable wideband extended codec according to ITU-T (International Telecommunication Union-Telecommunication Standardization Sector) recommendation G.711.1 [8] for a low complexity partial narrowband (NB) mixing and a selective switching of the wideband extension signal from the dominant channel. This principle is illustrated in FIG. 4. Here, the coded signal from each channel, or location, comprises the NB core layer, denoted “primary” in the figure, and the wideband (WB) enhancement layer, denoted “secondary” in the figure. The MCU carries out conventional mixing, i.e. addition, of only the G.711 NB core layer signals, while the enhancement layer of only the most active location is retained. The advantage of this concept lies in the low complexity required to mix G.711 core layer encoded signals, since they are in PCM format, and to switch through the wideband enhancement layer of the active channel, to avoid decoding and re-encoding of that layer. However, this solution is only beneficial in implementations where the mixing of the core layer is performed in the coded speech domain or speech codec parameter domain, or when using G.711, where the coding is PCM.
Problems with Existing Solutions
Typically, teleconference systems involving mixing of a plurality of channels require decoding of the signals of the various incoming channels to make them available in the PCM domain, in which they can be mixed. The mixed PCM signal is then re-encoded such that it is suitable for transmission to a receiving terminal K. This means that there are at least two speech codecs in tandem configuration: The first codec is operated with encoding at the sending parties, A through K, and with decoding in the MCU; the second codec is operated with encoding of the mixed PCM signal in the MCU and decoding of that signal at the receiving terminal K.
One problem associated with this kind of processing is a quality degradation that arises from the tandem configuration of codecs. Each stage of de-coding and re-encoding increases the coding distortions in the finally decoded output signal.
A further quality problem arises from the fact that speech codecs are typically designed to work well with a single speech signal, since the speech codecs are built upon a speech production model that mimics the human vocal tract. When a mixed signal to be encoded comprises speech from a plurality of speakers talking simultaneously or the active speaker signal together with a significant amount of background noise signals from the other channels, which both are typical situations in teleconferencing, this speech production model does no longer apply. Consequently, the quality of the decoded mixed signal at terminal K may be poor due to significant coding distortions.
Other Techniques Avoiding Degradations Due to Codec Tandeming
There are examples of speech/audio codecs that allow the mixing operation to be performed in the coded domain. Hence, referring to FIG. 2, essentially the decoder and encoder blocks in the MCU are not required for such a case. Examples of codecs allowing mixing in the coded domain are frequency domain codecs such as e.g. MPEG-4 AAC (Moving Picture Expert Group Advanced Audio Coding) [5] and also the MPEG SAOC (Spatial Audio Object Coding) [6], presently being under standardization. However, these codecs, as they are not based on a speech production model, are less suitable for use for teleconferencing in many communication systems and especially mobile communication systems that require very bit rate efficient operation in order to save limited transmission capacity.
Further, a compressed domain conference bridge is described in [9], where the incoming signals of one or two of the most active channels are re-encoded through a compressed domain transcoder. The choice of whether either one or two simultaneous channels are to be encoded in the bridge depends on the capability of the codec supported by the receiving terminal. This kind of bridge avoids tandem coding artifact to some extent by performing the transcoding in the speech codec parameter domain rather than in the decoded speech (PCM) domain, and through the use of a special speech codec that is especially designed to be able to cope with two simultaneous speaker signals. However, as with the codecs described in [5] and [6] the constraint of having a codec for teleconferencing use that allows transcoding or mixing of the signals from the conference participant in the codec parameter domain is a severe limitation and is generally prohibitive for achieving high coding efficiency. It is hence undesirable to use specially designed codecs for multi-party conference use, since the cost in terms of bit rate need for such codecs typically is much higher than for high efficient state-of-the-art codecs which often follow the analysis-by-synthesis principle with an assumed speech production model.