The speech codec that encodes a monaural speech signal is the norm now. Such a monaural codec is commonly used in the communication equipment such as a mobile phone and teleconferencing equipment where the signal usually comes from a single source, for example, human speech.
In the past, due to the limitation of the transmission bandwidth and the processing speed of DSPs, such a monaural signal is used. However, the technology progresses and bandwidth improves, and this constraint is slowly becoming less important. Quality of speech on the other hand becomes a more important factor to be considered. One drawback of the monaural speech is that the monaural speech does not provide spatial information such as sound imaging or position of the speakers and the like. Therefore, a factor to be considered is to achieve good stereo speech quality at the lowest possible bit rate so as to realize better sound.
One method of encoding a stereo speech signal includes utilizing signal prediction or estimation technique. That is, one channel is encoded using a prior known audio coding technique and the other channel is predicted or estimated from the encoded channel using some side information of the other channel which is analyzed and extracted.
Such method can be found in Patent Document 1 as part of the binaural cue coding system (for example, see Non-Patent Document 1) which is applied to the computation of the inter-channel level difference (ILD) for the purpose of adjusting the level of one channel with respect to a reference channel.
Frequently, the predicted or estimated signal is not as accurate compared to the original signal. Therefore, the predicted or estimated signal needs to be enhanced so that it can be as similar to the original as possible.
An audio signal and speech signal are commonly processed in the frequency domain. This frequency domain data is generally referred to as the “spectral coefficients in the transformed domain.” Therefore, such a prediction and estimation method can be done in the frequency domain. For example, the left and right channel spectrum data can be estimated by extracting some of the side information and applying the result to the monaural channel (see Patent Document 1). Other variations include estimating one channel from the other channel as in the left channel which can be estimated from the right channel.
One area in audio and speech processing where such enhancement is applied is the spectrum energy estimation. It can also be referred to as “spectrum energy prediction” or “scaling.” In a typical spectrum energy estimation computation, the time domain signal is transformed to a frequency domain signal. This frequency domain signal is usually partitioned into frequency bands according to critical bands. This is done for both channels, that is, the reference channel and the channel which is to be estimated. For frequency bands of both channels, the energy is computed and scale factors are calculated using the energy ratios of both channels. These scale factors are transmitted to the receiving apparatus where a reference signal is scaled using these scale factors to retrieve the estimated signal in the transformed domain for frequency bands. Then, an inverse frequency transform is applied to obtain the equivalent time domain signal of the estimated transformed domain spectrum data.    Patent Document 1: International publication No. 03/090208 pamphlet    Non-Patent Document 1: C. Faller and F. Baumgarte, “Binaural cue coding: A novel and efficient representation of spatial audio”, Proc. ICASSP, Orlando, Fla., October 2002.