When listen to music or watch a video with audio, it is desirable for the audience to have high degree of audio envelopment, so that they have better sensation of the audio/and video scene. The sense of audio envelopment includes immersive 3D audio and accurate audio localization. Immersive 3D audio means that the audio system is able to virtualize sound sources at any position in space. Accurate audio localization means that the audio system is able to locate the sound sources precisely align with the original audio scene, in terms of both direction and distance [1].
The sense of audio envelopment can be provided by a 3D audio system, which uses a large number of loudspeakers. The speakers might be surrounding the audience and be situated at high, mid and low vertical positions.
Three types of input signals and formats are commonly used in 3D audio system: channel-based input, object-based input and Higher-Order Ambisonics.
Channel-based input is commonly used in today's 2D and 3D audio signal production processes and media (e.g. 22.2, 9.1, 8.1, 7.1, 5.1 etc), where each produced audio signal channel is intended to directly drive a loudspeaker in a designated position.
For object-based input, each produced audio signal channel represents an audio source that is intended to be rendered at a designated spatial position, independent of the number and location of actually available loudspeakers.
For Higher-Order Ambisonics (HOA), each produced audio signal channel is part of an overall description of the entire sound scene, independent of the number and location of actually available loudspeakers.
Among the three formats, the HOA format is representation of audio scene it is possible to render the ambisonic signals to any playback setup, including the non-standard speaker layout.
In prior arts, such as the model for MPEG-H 3D audio standardization, for the HOA format, at the decoder side, the HOA signal is firstly reconstructed from decoded core signals and then rendered to the speaker setup.
FIG. 1 illustrates decoder in the model of MPEG-H 3D audio standardization, for the HOA format.
Firstly, the input bit stream is de-multiplexed (101) into N bit streams originally created by the AAC-family mono encoders plus the parameters required to recompose the full HOA representation from these bit streams.
In the multi-channel perceptual decoding component (102, 103 and 104), the N bit streams are individually decoded by AAC-family mono decoders to produce N signals.
In the successive spatial decoding component, first, the actual value range of these signals is reconstructed by the inverse gain control processing (105). In a next step, the N signals are re-distributed to provide the M pre-dominant signals and (N−M) HOA coefficient signals representing the more ambient HOA components (105).
The fixed subset of the (N−M) HOA coefficient signals is re-correlated, this means the decorrelation at the HOA encoding stage is reversed (107).
Next, all of the (N−M) HOA coefficient signals are used to create the ambient HOA components (107).
The predominant HOA components are synthesized from the M predominant signals and the corresponding parameters (106).
Finally, the predominant and the ambient HOA components are composed into the desired full HOA representation (108), which is then rendered to a given loudspeaker setup (109).
The detail process of the predominant sound synthesis, ambiance synthesis, HOA composition and rendering is explained as below.
In the Predominant Sound Synthesis (PSS) block (106), the HOA representation of the predominant sound component is computed from either of two methods. These methods are referred to as ‘directional based’ and ‘vector based’.
In vector based PSS, the predominant sound is computed from the vector based signals. XVEC(k). The XVEC(k) signals represent time domain audio signals that have been decoupled from their spatial characteristics. The reconstructed HOA coefficients are computed by multiplying the vector based signals XVEC(k) with corresponding transformation vectors (represented by multiple vectors in MVEC (k)). The MVEC (k) thus contain spatial characteristics (such as directionality and width) of the corresponding XVEC (k) time domain audio signals. The computation can be seen as below:CVEC(k)=(XVEC(k)(VEC(k))T)T  (1)where                XVEC(k) denotes the decoded vector based predominant sound        MVEC (k) denotes the matrix to reconstruct the HOA coefficients from the vector based predominant sound        CVEC(k) denotes the reconstructed HOA coefficients from the vector based predominant sound        
In directional based PSS, the HOA coefficients are computed from all direction based predominant sound signals XPS (k), using the tuple set DIR (k), the computation can be seen as below:CDIRk=(XPS(k)(DIR(k))T)T  (2)where                XPS(k) denotes the decoded direction based predominant sound        MDIR (k) denotes the matrix to reconstruct the HOA coefficients from the direction based predominant sound        CDIR (k) denotes the reconstructed HOA coefficients from the direction based predominant sound        
In Ambient Synthesis, the ambient HOA component frame CAMB (k) is obtained as below, according to reference [2]:                1) The first OMIN coefficients of the ambient HOA component are obtained by        
                                          [                                                                                                      c                                              AMB                        ,                        1                                                              ⁡                                          (                      k                      )                                                                                                                                                              c                                              AMB                        ,                        2                                                              ⁡                                          (                      k                      )                                                                                                                    ⋮                                                                                                                        c                                              AMB                        ,                                                  O                          MIN                                                                                      ⁡                                          (                      k                      )                                                                                            ]                    =                                    Ψ              MIN                        ·                          [                                                                                                                  c                                                  I                          ,                          AMB                          ,                          1                                                                    ⁡                                              (                        k                        )                                                                                                                                                                                c                                                  I                          ,                          AMB                          ,                          2                                                                    ⁡                                              (                        k                        )                                                                                                                                  ⋮                                                                                                                                      c                                                  I                          ,                          AMB                          ,                                                      O                            MIN                                                                                              ⁡                                              (                        k                        )                                                                                                        ]                                      ,                            (        3        )                                                Where                            OMIN denotes the minimum number of ambient HOA coefficients                ΨMIN denotes the mode matrix with respect to some fixed predefined directions                cI,AMB,n (k) denotes the decoded ambient sound signal                                                2) The sample values of the remaining coefficients of the ambient HOA component are computed according to        
                                          c                          AMB              ,              n                                ⁡                      (            k            )                          =                  {                                                                                          C                                          I                      ,                      AMB                      ,                      n                                                        ⁡                                      (                    k                    )                                                                                                                    if                    ⁢                                                                                  ⁢                    n                                    ∈                                                                                    𝒥                                                  AMB                          ,                          ACT                                                                    ⁡                                              (                        k                        )                                                              ⁢                    \                    ⁢                                          {                                              1                        ,                        …                        ⁢                                                                                                  ,                                                  O                          MIN                                                                    }                                                                                                                          0                                            else                                                                        (        4        )            
Finally, in the HOA Composition the ambient HOA component and the predominant sound HOA component are superposed to provide the decoded HOA frame. If the prediction is not activated for the direction based predominant synthesis, the decoded HOA frame C(k) is computed byC(k)=CAMB(k)+CDIR(k) for direction based synthesis  (5)C(k)=CAMB(k)+CVEC(k) for vector based synthesis  (6)                Where                    CVEC (k) denotes the reconstructed HOA coefficients from the vector based predominant sound            CDIR (k) denotes the reconstructed HOA coefficients from the direction based predominant sound            CAMB (k) denotes the reconstructed HOA coefficients from the ambient signal            C (k) denotes the final reconstructed HOA coefficients                        
If the near field compensation is not applied, the decoded HOA coefficients C(k) is converted to the representation of loudspeaker signals W(k) by multiplication with the rendering matrix D:W(k)=DC(k).  (7)
where                C(k) denotes the final reconstructed HOA coefficients        W(k) denotes the loudspeaker signals        D denotes the rendering matrix        
In order to calculate the complexity of the above process, the following notations are defined:                1) the order of HOA signal is OHOA, then the number of HOA coefficients is (OHOA+1)2,        2) the number of play back speakers is L.        3) the total number of core signal channel is N        4) the number of predominant sound channels is M        5) the number of ambient sound channels is N−M        
The complexity for Predominant Sound Synthesis isCOMPSS=Fs*M*(OHOA+1)2  (8)where                COMPSS denotes the complexity for predominant sound synthesis        M denotes the number of predominant sound channels        OHOA denotes the order of HOA        Fs denotes the sampling frequency        
The complexity for Rendering isCOMRENDER=Fs*L*(OHOA+1)2  (9)where                COMRENDER denotes the complexity for rendering        L denotes the number play back speakers        OHOA denotes the order of HOA        Fs denotes the sampling frequency        
The number of HOA coefficients is very large in typical HOA formats, as example if OHOA=4, then number of HOA coefficients is (4+1)2=25.
And in order to have better sensation of the 3D audio, the number of playback channels is also very large, for example, 22.2 setup has in total of 24 speakers.
The sampling frequency for audio signal is normally at 44.1 kHz or 48 kHz.
As example, the complexity is estimated for the predominant sound synthesis and rendering for M=4, OHOA=4, L=24 and Fs=48 kHz:
                                          COM            PSS                    =                    ⁢                                    F              S                        *            M            *                                          (                                                      O                    HOA                                    +                  1                                )                            2                                                                    =                    ⁢                      48            ⁢                                                  ⁢            k            *            4            *                                          (                                  4                  +                  1                                )                            2                                                                    =                    ⁢                      4.8            ⁢                                                  ⁢            MOPS                                                                        COM            RENDER                    =                    ⁢                                    F              S                        *            L            *                                          (                                                      O                    HOA                                    +                  1                                )                            2                                                                    =                    ⁢                      48            ⁢                                                  ⁢            k            *            24            *                                          (                                  4                  +                  1                                )                            2                                                                    =                    ⁢                      28.8            ⁢                                                  ⁢            MOPS                              
From the example, it can be seen that both of the synthesis and rendering processes are very complex and it is desirable to reduce the complexity.