New 3D channel based Audio formats provide audio mixes for loudspeaker channels that not only surround the listening position, but also include channels positioned above (height) and below in respect to the listening position (sweet spot). The mixes are suited for a special positioning of these speakers. Common formats are 22.2 (i.e. 22 channels) or 11.1 (i.e. 11 channels).
FIG. 1 shows two examples of ideal speaker positions in different speaker setups: a 22-channel speaker setup (left) and a 12-channel speaker setup (right). Every node shows the virtual position of a loudspeaker. Real speaker positions that differ in distance to the sweet spot are mapped to the virtual positions by gain and delay compensation.
A renderer for channel based audio receives L1 digital audio signals w1 and processes the output to L2 output signals w2. FIG. 2 shows, in an embodiment, the integration of a renderer 21 into a reproduction chain. The renderer output signal w2 is converted to an analog signal in a D/A converter 22, amplified in an amplifier 23 and reproduced by loudspeakers 24.
The renderer 21 uses the position information of the input speaker setup and the position information of the output loudspeaker 24 setup as input to initialize the chain of processing. This is shown in FIG. 3. Two main processing blocks are a Mixing & Filtering block 31 and a Delay & Gain Compensation block 32.
The speaker position information can be given e.g. in Cartesian or spherical coordinates. The position for the output configuration R2 may be entered manually, or derived via microphone measurements with special test signals, or by any other method. The positions of the input configuration R1 can come with the content by table entry, like an indicator e.g. for 5-channel surround. Ideal standardized loudspeaker positions [9] are assumed. The positions might also be signaled directly using spherical angle positions. A constant radius is assumed for the input configuration. Let R2=[r21, r22, . . . , r2L2] with r2l=[r2l, θ2l, φ2l]T=[r2l, {circumflex over (Ω)}lT]T be the positions of the output configuration in spherical coordinates. Origin of the coordinate system is the sweet spot (i.e. listening position). r2l is the distance between the listening position and a speaker l, and θl, φl are the related spherical angles that indicate the spatial direction of the speaker l relative to the listening position.
Delay and Gain Compensation
The distances are used to derive delays and gains l that are applied to the loudspeaker feeds by amplification/attenuation elements and a delay line with dl unit sample delay steps. First, the maximal distance between a speaker and the sweet spot is determined:r2max=max([r21, . . . r2L2]).
For each speaker feed the delay is calculated by:dl=└(r2max−r2l)fs/c+0.5┘  (1)with sampling fs, speed of sound c (c≅343 m/s at 20° celsius temperature) and └x+0.5┘ indicates rounding to next integer. The loudspeaker gains l are determined by
                                        l                =                              r            ⁢                                                  ⁢                          2              l                                            r            ⁢                                                  ⁢                          2              max                                                          (        2        )            
The task of the Delay and Gain Compensation building block 32 is to attenuate and delay speakers that are closer to the listener than other speakers, so that these closer speakers do not dominate the sound direction perceived. The speakers are thus arranged on a virtual sphere, as shown in FIG. 1. The Mix & Filter block 31 now can use virtual speaker positions {circumflex over (R)}2=[1, 2, . . . , L2] with l=[r2max, {circumflex over (Ω)}1T]T with a constant speaker distance.
Mix & Filter
In an initialization phase, the speaker positions of the input and idealized output configurations R1, {circumflex over (R)}2 are used to derive a L2×L1 mixing matrix G. During the process of rendering, this mixing matrix is applied to the input signals to derive the speaker output signals. As shown in FIG. 4, two general approaches exist. In the first approach shown in FIG. 4a), the mixing matrix is independent from the audio frequency and the output is derived by:W2=G W1,  (3)where W1εL1×τ, W2εL2×τ denote the input and output signals of L1, L2 audio channels and τ time samples in matrix notation. The most prominent method is Vector Base Amplitude Panning (VBAP) [1].
In the second approach, the mixing matrix becomes frequency dependent (G(f)), as shown in FIG. 4b). Then, a filter bank of sufficient resolution is needed, and a mixing matrix is applied to every frequency band sample according to eq. (3).
Examples for the latter approach are known [2],[3],[4]. For deriving the mixing matrix, the following approach is used: A virtual microphone array 51 as depicted in FIG. 5, is placed around the sweet spot. The microphone signals M1 of sound received from the input configuration (the original directions, left-hand side) is compared to the microphone signals M2 of sound received from the desired speaker configuration (right-hand side). Let 1εM×τ denote M microphone signals receiving the sound radiated from the input configuration, and 2εM×τ be M microphone signals of the sound from the output configuration. They can be derived by1=HM,L1 W1  (4)and2=HM,L2 W2  (5)with HM,L1εM×L1, HM,L2εM×L2 being the complex transfer function of the ideal sound radiation in the free field, assuming spherical wave or plane wave radiation. The transfer functions are frequency dependent. Selecting a mid-frequency fm related to a filter bank, eq. (4) and eq. (5) can be equated using eq. (3). For every fm the following equation needs to be solved to derive G(fm):HM,L1 W1=H M,L2 G W1  (6)
A solution that is independent of the input signals and that uses the pseudo inverse matrix of HM,L2 can be derived as:G=HM,L2+ HM,L1.  (7)
Usually this produces non-satisfying results, and [2] and [5] present more sophisticated approached to solve eq. (6) for G.
Further, there is a completely different way of signal adaptive rendering, where the directional signals of the incoming audio content is extracted and rendered like audio objects. The residual signal is panned and de-correlated to the output speakers. This kind of audio rendering is much more expensive in terms of computational complexity, and often not free from artifacts. Signal adaptive rendering is not used and only mentioned here for completeness.
One problem is that a consumer's home setup is very likely to use a different placement of speakers due to real world constraints of a living room. Also the number of speakers may be different. The task of a renderer is thus to adapt the channel based audio signals to a new setup such that the perceived sound, loudness, timbre and spatial impression comes as close as possible to the original channel based audio as replayed on its original speaker setup, like e.g. in the mixing room.