Mobile devices are becoming multi-functional devices where various applications are used. In particular, today's cellular phones are also a digital camera, a TV/radio receiver, and a music playback device.
Mixed contents of speech and music are recorded and played on mobile devices. The content is itself streamed or broadcasted to the devices. In mobile applications, highly efficient low-rate coding is in a demand for both speech and music contents.
Current speech and audio codecs performance tend to depend on the types of contents. The state-of-the art speech and audio codecs are tailored and optimized to either speech or music. Speech and audio codecs have in fact evolved independently to each other in terms of target bit rates and corresponding applications. However, recent applications on mobile devices makes the two approaches face the same type of requirements in terms of bit rates and quality.
There have been attempts to standardize codecs that are capable of handling both speech and audio content. One such effort has been conducted in 3GPP with the standardization of AMR-WB+ and E-AAC+. The quality of the resulting codecs, although outperforming specific codecs targeted either at speech or music, still show a tendency to depend on the types of audio contents. That is, music contents are best coded by an audio codec such as EAAC+, and speech contents are best coded by a speech codec such as AMR-WB+.
The MPEG community has also initiated work on unified speech and audio coding (USAC) targeting mainly mobile applications. Such work has resulted in an adoption of a scheme that combines the switching between a time-domain coding paradigm and a frequency domain paradigm as described in Neuendorf, M.; Gournay, P.; Multrus, M.; Lecomte, J.; Bessette, B.; Geiger, R.; Bayer, S.; Fuchs, G.; Hilpert, J.; Rettelbach, N.; Salami, R.; Schuller, G.; Lefebvre, R.; Grill, B. “Unified speech and audio coding scheme for high quality at low bit rates” ICASSP 2009. IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. 19-24 Apr. 2009. Page(s): 1-4.
Using two fundamentally different coding paradigms in one unified system poses a series of problems at the transition points where one core codec switches over to the other: risk of blocking artifacts, possible overhead of information required by transitions and necessity for constant framing. In a framework similar to the Unified Speech and Audio Coder (USAC) as described in Jeremie Lecomte, Philippe Gournay, Ralf Geiger, Bruno Bessette, Max Neuendorf. “Efficient cross-fade windows for transitions between LPC-based and non-LPC based audio coding”, Audio Engineering Society Convention Paper, Presented at the 126th Convention 2009 May 7-10 Munich, Germany, all this is particularly challenging because the frequency domain core codec uses a Modified Discrete Cosine Transform (MDCT). The MDCT allows an overlapping of adjacent blocks by a maximum of 50% without introducing additional overhead. This is particularly helpful to smooth blocking artifacts, but requires introducing Time-domain Aliasing (TDA) which may be cancelled out during synthesis as described in J. Princen and A. Bradley, “Analysis/Synthesis Filter Bank Design Based on Time-domain Aliasing Cancellation”, IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 34 n. 5, October 1986. A Time-domain Aliasing Cancellation (TDAC) is done by an adequate overlap-add operation of adjacent MDCT blocks on synthesis side.
In USAC however, adjacent blocks can be coded using the Time-domain (TD) coder, which has either Time-domain Aliasing (TDA) in a weighted LPC domain and not in the signal domain or no TDA at all.
In order to allow proper aliasing cancellation with the Frequency Domain (FD) mode (which introduces aliasing in the signal domain), the required aliasing components may be converted into the signal domain (case a) or are introduced artificially by simulating the MDCT operations of analysis windowing, folding, unfolding and synthesis windowing (case b). Another solution to this problem is the design of MDCT analysis/synthesis windows without a TDAC region. The overlap-add operation is then the same as a simple cross-fade over the range of the window slope. Both methods are used in USAC RM0. In order to get the necessary and appropriate overlap areas for cross-fade and TDAC, a slightly different time alignment between the two coding modes had to be introduced.
According to the USAC scheme, a modified start window without any time aliasing on its right side was designed. The right part of this window, which is represented in FIG. 10, finishes before the centre of the TDA (i.e. the folding point) of the MDCT. Consequently, the modified start window is free of time-domain aliasing on its right side. Compared to the standard short window which has an overlap of 128 samples (including TDA), the overlap region of the modified start window is reduced to 64 samples. This overlap region is however still sufficient to smooth the blocking effect. Furthermore, it reduces the impact of the inaccuracy due to the start of the time-domain coder by feeding it with a faded-in input. Note that this transition requires an overhead of 64 samples, i.e. that 64 samples are coded by both the TD codec and the FD codec. This results in a small difference in alignment between the TD and the FD core codecs. This small misalignment is compensated when the codec switches back again to the FD codec, as explained in section 4.4.2. of [2]. Note also that the standard start window with its 128-sample overlap region would have introduced twice as much overhead samples. One of the most important aspects in speech coding, especially in wireless networks is to keep a constant bit rate and a constant framing. This is due to the fact that the radio interfaces have been designed and optimized for legacy speech codecs which have a constant frame length and a constant bit rate. For instance, an important scheduling mode in 3GPP Long Term Evolution (LTE) radio access system is the so-called semi-persistent scheduling, which optimizes radio resources with the assumption that VoIP packets have a constant size and a constant frame rate. Dynamic scheduling is also possible however it comes at an increased cost in terms of radio resources being spent on signalling. Because of these requirements of constant bit rate and constant frame rate, schemes such as USAC are impractical since switching back and forth between TD and FD coding modes would lead to de-synchronization.
Similar problems may in general also occur when switching between two different signal processing modes or codecs, and may also occur in other signal processing areas, e.g. image or video processing or coding.