Efficient encoding of audio signals is critical for an increasing number of applications and systems. For example, mobile communications use efficient speech encoders to reduce the amount of data that needs to be transmitted over the air interface.
For example, the International Telecommunication Union (ITU) is standardizing a speech encoder known as the Embedded Variable Bit Rate Codec (EV-VBR) which can encode a speech signal at high quality with data rates ranging from 8 to 64 kbps. This encoder, as well as many other efficient speech encoders, uses Code Excited Linear Prediction (CELP) techniques to achieve the high compression ratio of the encoding process at the lower bit rates of operation.
In some applications, more than one audio signal may be captured and in particular a stereo signal may be recorded in audio systems using two microphones. For example, stereo recording may typically be used in audio and video conferencing as well as broadcasting applications.
In many multi channel encoding systems, and in particular in many multi channel speech encoding systems, the low level encoding is based on encoding of a single channel. In such systems, the multi channel signal may be converted to a mono signal for the lower layers of the coder to encode. The generation of this mono signal is referred to as down-mixing. Such down-mixing may be associated with parameters that describe aspects of the stereo signal relative to the mono signal. Specifically, the down mixing may generate inter-channel time difference (ITD) information which characterises the timing difference between the left and right channels. For example, if the two microphones are located at a distance from each other, the signal from a speaker located closer to one microphone than the other will reach the latter microphone with a delay relative to the first one. This ITD may be determined and may in the decoder be used to recreate the stereo signal from the mono signal. The ITD may significantly improve the quality of the recreated stereo perspective since ITD has been found to be the dominant perceptual influence on stereo location for frequencies below approximately 1 kHz. Thus, it is critical that ITD is also estimated.
Conventionally, the mono signal is generated by summing the stereo signals together. The mono signal is then encoded and transmitted to the decoder together with the ITD.
For example, the European Telecommunication Standards Institute has in their Technical Specification ETSI TS126290 “Extended Adaptive Multi-Rate—Wideband (AMR-WB+) Codec; Transcoding Functions” defined a stereo signal down-mixing where the mono signal is simply determined as the average of the left and right channels as follows.xML(n)=0.5(xLL(n)+xRL(n))where xML(n) represents the nth sample of the mono signal, xLL(n) represents the nth sample of the left channel signal and xRL(n) represents the nth sample of the right channel signal.
Another example of a downmix is provided in H. Purnhagen, “Low Complexity Parametric Stereo Coding in MPEG-4”, Proceedings 7th International Conference on Digital Audio Effects (DAFx '04), Naples, Italy, Oct. 5-8, 2004, pp 163-168. In this document, a down-mixing method is described which obtains an output mono signal as a weighted sum of the incoming channels on a band-by-band frequency basis using information obtained about the inter-channel intensity difference (IID). Specifically:M[k,i]=glL[k,i]+grR[k,i]where M[k,i] represents the ith sample of the kth frequency bin of mono signal, L[k,i] represents the ith sample of the kth frequency bin of the left channel signal and R[k,i] represents the ith sample of the kth frequency bin of the right channel signal, gl is the left channel weight and gr is the right channel weight.
A characteristic of such approaches is that they either result in mono signals having a high reverberation time or else have high complexity and/or delay. For example, the AMR-WB+method of down-mixing provides an output whose reverberation time is approximately that of the room plus the flight time between the two microphones. The downmix provided in Purnhagen is of high complexity and imposes a delay due to the frequency analysis and reconstruction.
However, many mono encoders provide the best results for signals with low reverberation times. For example, low bit rate CELP speech coders, and other encoders which employ pulse-based excitation to represent speech and audio signals, perform best when presented with signals having short reverberation times. Accordingly, the performance of the encoder and the quality of the resulting encoded signal tend to be suboptimal.
Hence, an improved system would be advantageous and in particular a system allowing increased flexibility, facilitated implementation, improved encoding quality, improved encoding efficiency, reduced delay and/or improved performance would be advantageous.