There is a problem in that binaural rendering for hearing multi-channel signals in stereo requires a high computational complexity as the length of a target filter increases. In particular, when a binaural room impulse response (BRIR) filter reflected with characteristics of a recording room is used, the length of the BRIR filter may reach 48,000 to 96,000 samples. Herein, when the number of input channels increases like a 22.2 channel format, the computational complexity is enormous.
When an input signal of an i-th channel is represented by xi(n), left and right BRIR filters of the corresponding channel are represented by biL(n) and biR(n), respectively, and output signals are represented by yL(n) and yR(n), binaural filtering can be expressed by an equation given below.
                                                        y              m                        ⁡                          (              n              )                                =                                    ∑              i                        ⁢                                                  ⁢                                                            x                  i                                ⁡                                  (                  n                  )                                            *                                                b                  i                  m                                ⁡                                  (                  n                  )                                                                    ,                                  ⁢                              where            ⁢                                                  ⁢            m                    ∈                      {                          L              ,              R                        }                                              [                  Equation          ⁢                                          ⁢          1                ]            
Herein, * represents a convolution. The above time-domain convolution is generally performed by using a fast convolution based on Fast Fourier transform (FFT). When the binaural rendering is performed by using the fast convolution, the FFT needs to be performed by the number of times corresponding to the number of input channels, and inverse FFT needs to be performed by the number of times corresponding to the number of output channels. Moreover, since a delay needs to be considered under a real-time reproduction environment like multi-channel audio codec, block-wise fast convolution needs to be performed, and more computational complexity may be consumed than a case in which the fast convolution is just performed with respect to a total length.
However, most coding schemes are achieved in a frequency domain, and in some coding schemes (e.g., HE-AAC, USAC, and the like), a last step of a decoding process is performed in a QMF domain. Accordingly, when the binaural filtering is performed in the time domain as shown in Equation 1 given above, an operation for QMF synthesis is additionally required as many as the number of channels, which is very inefficient. Therefore, it is advantageous that the binaural rendering is directly performed in the QMF domain.