1. Field of the Invention
Embodiments of the present invention relate, in general, to the field of surround sound recording and compression for transmission or storage purposes and particularly to those recording and compression devices involving low power.
2. Relevant Background
Surround sound recording typically requires complex multi-microphone setup with large inter-microphone spacing. However, there are scenarios wherein such complex setup is not possible. As an example, a video recorder with surround sound recording capability can be integrated as a feature in mobile phones. Obviously, the surround microphone array has to be very compact due to the limited mounting area. One means to integrate surround microphone recording in a limited mounting area is by using coincident microphone techniques. Such techniques utilize the psychoacoustic principles of Inter-aural Level Differences (“ILD”) to record and recreate the audio scene during surround sound playback. Coincident microphones require a minimum of three first-order directional microphones arranged so that the polar patterns of these microphones coincide on a horizontal plane. Some of the popular microphone setups for coincident surround recording are:
1. Double Mid/Side (“DMS”) array which consists of front-facing cardioid (mid-front), side-facing bidirectional (side) and rear-facing cardioid (mid-rear) microphones,
2. FLRB array which consists of front (F), left (L), right (R), and rear (B) facing cardioid microphones, and
3. B-format microphone array which consists of three or four microphones and additional signal processing to produce coincident B-format signals with omnidirectional (W), front-facing bidirectional (X) and side-facing bidirectional (Y) responses required for horizontal surround sound production.
FIGS. 1(a) and (b) shows the polar pattern of DMS and B-format microphone array signals, respectively, as known in the prior art. Each microphone produces directional signals that when weighted can be combined to form a virtual microphone signal. By properly designing the weighting factors, unlimited number of virtual microphone signals can be derived having first-order directivity pointing to any directions around the horizontal plane. Surround sound is obtained by deriving one virtual microphone signal for each surround sound channel. In this context, the weighting factors to derive each surround audio channel's signal are designed such that the resulting virtual microphone is pointing to the direction which corresponds to the location of the speaker in the surround playback configuration. This set of weighting factors will be referred to herein as channel coefficients. For example, a surround channel Ci is derived from B-format signals and its channel coefficients (αi, βi, γi) can be determined according to the equationCi=αiW+βiX+γiY. 
FIG. 2 shows the typical virtual-microphone polar pattern for a standard International Telecommunication Union (ITU) 5.0 surround sound signal as known in the prior art. In this example, the channel coefficients have been designed such that the virtual microphones for the center (C) 210, left-front (L) 220 and right-front (R) 230 surround channels possess supercardioid directivity and point to 0° and ±30°, respectively, while the virtual microphones for the left-surround (Ls) 240 and right-surround (Rs) 250 surround channel possess cardioid directivity and point to and ±110°, respectively.
In practice, the coincident-to-virtual microphone processing is implemented as a hardware matrix which attenuates and combines the microphone array signals according to a channel-coefficients matrix. The resulting signals thereafter are stored for distribution or playback. Due to the multi-channel signal representation, a significant amount of memory space and transmission bandwidth is required. This requirement scales up linearly with the number of surround sound channels. To achieve efficient storage and transmission, signal compression needs to be employed. State-of-the-art perceptual or hybrid audio compression schemes such as Moving Pictures Expert Group (“MPEG”)-1 layer 3 and Advanced Audio Coder compress monaural or stereo audio signals very efficiently. However for multi-channel signals, the required data rate scales up with the number of surround sound channels making efficient compression challenging.
Recently, MPEG Surround (“MPS”) has been standardized as a multi-channel audio compression scheme which represents surround sound by a set of downmix signals (with a lower number of channels than the surround sound, eg. monaural or stereo downmix) and low-overhead spatial parameters that describe its spatial properties. A decoder is able to reconstruct the original surround sound channels from the downmix signals and transmitted spatial parameters. When combined with perceptual audio coders to compress the monaural or stereo dowmnix signals, MPS enables an efficient representation of surround sound that is compatible with the existing mono or stereo infrastructure. A generic MPS multi-channel audio encoding structure, as known in the prior art, is shown in FIG. 3.
Time/Frequency (“T/F”) analysis 310 consists of an exponential-modulated Quadrature Mirror Filterbank (“QMF”) filtering followed by a low-frequency filtering to increase the frequency resolution for the lower subbands. Together, this filtering scheme is referred to as hybrid analysis filtering. The filtering is performed on each surround sound channel to convert the time-domain audio signals into the subband-domain signal representations. The multi-channel subband signals are then passed to a spatial encoding stage 320 that calculates the spatial parameters 340 and performs signal downmixing into a lower number of audio signals. The output-downmix signals are synthesized back into the time domain 330 and can be further compressed using any audio compression schemes, as known to one skilled in the relevant art. Spatial parameters 340 are quantized and formatted 350 according to the spatial audio syntax and typically appended to the downmix-audio bitstream. Optionally, a set of residual signals can be derived and coded according to AAC low-complexity syntax. These coded signals then can be transmitted in the spatial parameter bitstream to enable full waveform reconstruction at the decoder side.
The spatial encoding stage 320 is realized as a tree structure, which comprisies a series of Two-to-One (TTO) and Three-to-Two (TTT) encoding blocks. Representative depictions of a typical TTO and TTT encoding scheme as known to one skilled in the relevant art are shown in FIGS. 4a and 4b. A TTO encoding block 430 takes a subband-domain signal pair 450 as input, calculates the signal energy and cross-correlation, and groups these values into several parameter bands with non-linear frequency bandwidth. At each parameter band, spatial parameters 460 and downmix scalefactors are calculated. The subband-domain signal pair is thereafter mixed to derive the monaural 465 and residual signals 460. The monaural (summed) signal is subsequently scaled by the downmix scalefactor, which is required to ensure overall energy preservation in the downmix signal. The residual (subtracted) signal 460 is either discarded or coded for transmission in the spatial parameter bitstream. TTT performs similar operations but with three input signals and stereo output-downmix signals. As shown a TTT encoding block 440 produces a stereo downmix from a left, center and right signal combination.
In the stereo-based encoding mode, MPS coding scheme provides the possibility to transmit matrix-compatible or 3D-stereo downmixes 470 instead of the standard stereo downmix. The transmission of matrix-compatible stereo downmix provides backward compatibility with legacy matrixed surround decoders, while 3D stereo downmix provides the advantage of binaural listening for existing stereo playback system. In generic encoding schemes, these downmixes are created by applying a 2×2 post-processing matrix that modifies the energy and phase of the standard stereo dowmmix signal. Upon receiving these downmixes, a standard MPS decoder is able to revert back to the standard stereo downmixes by applying the inverse of the post-processing matrix.
Due to the structure of the encoder, the memory and computational requirement of a MPS encoder is highly dependent on the number of surround audio channels. The computational requirement is magnified by the subband samples having a complex-number representation. MPS hybrid analysis filtering is a computationally intensive scheme and it has to be performed on each of the surround audio channels. This implies that the memory and computational requirement of the encoder scales up linearly with the number of surround audio channels. Furthermore, in the spatial encoding stage, the energy and cross-correlation calculation and subband signal downmixing contribute to substantial computational power as they have to be performed at each encoding block. As the number of surround sound channels, is increased, more TTO and/or TTT blocks are required to encode the extra channels, which increases the overall computational requirement of the encoder. Such dependency is highly inefficient for the encoding of coincident surround sound recording and might become a bottleneck in applications with limited processing power.
In a coincident surround sound recording scheme, the number of the required microphone array signals is less than the number of the derived virtual microphone signals. Furthermore, the same microphone array signals can be used to derive different surround audio signals for different playback configurations simply by changing the size and coefficients of the channel-coefficients matrix. For example, a 5.0 and a 7.0 surround sound signal can be derived from B-format signals by designing the corresponding 3-to-5 and 3-to-7 channel-coefficients matrixes, respectively. It can be seen, therefore, that the required number of coincident microphone signals is independent of the number of surround channels; yet encoding and compression of these channels remains a challenge.