Higher Order Ambisonics (HOA) offers a possibility to represent three-dimensional sound. Other known techniques are wave field synthesis (WFS) or channel based approaches like 22.2. In contrast to channel based methods, however, the HOA representation offers the advantage of being independent of a specific loudspeaker set-up. This flexibility, however, is at the expense of a decoding process which is required for the playback of the HOA representation on a particular loudspeaker set-up. Compared to the WFS approach, where the number of required loudspeakers is usually very large, HOA may also be rendered to set-ups consisting of only few loudspeakers. A further advantage of HOA is that the same representation can also be employed without any modification for binaural rendering to head-phones.
HOA is based on the representation of the so-called spatial density of complex harmonic plane wave amplitudes by a truncated Spherical Harmonics (SH) expansion. Each expansion coefficient is a function of angular frequency, which can be equivalently represented by a time domain function. Hence, without loss of generality, the complete HOA sound field representation actually can be assumed to consist of O time domain functions, where O denotes the number of expansion coefficients. These time domain functions will be equivalently referred to as HOA coefficient sequences or as HOA channels in the following. Usually, a spherical coordinate system is used where the x axis points to the frontal position, the y axis points to the left, and the z axis points to the top. A position in space x=(r,θ,ϕ)T is represented by a radius r>0 (i.e. the distance to the coordinate origin), an inclination angle θ∈[0,π] measured from the polar axis z and an azimuth angle ϕ∈[0,2π] measured counter-clockwise in the x−y plane from the x axis. Further, (·)T denotes the transposition.
A more detailed description of the HOA coding is provided in the following. The Fourier transform of the sound pressure with respect to time denoted by t(·), i.e., P(ω,x)=t(p(t,x))=∫−∞∞p(t,x)e−iωt dt with ω denoting the angular frequency and i indicating the imaginary unit, may be expanded into the series of Spherical Harmonics according to P(ω=kcs,r,θ,ϕ)=Σn=0N Σm=−nn Anm (k)jn(kr)Snm(θ,ϕ).
Here cs denotes the speed of sound and k denotes the angular wavenumber, which is related to the angular frequency ω by
  k  =            ω              c        s              .  Further, jn(·) denote the spherical Bessel functions of the first kind and Snm(θ,ϕ) denote the real valued Spherical Harmonics of order n and degree m. The expansion coefficients Anm(k) only depend on the angular wavenumber k. Note that it has been implicitly assumed that sound pressure is spatially band-limited. Thus, the series is truncated with respect to the order index n at an upper limit N, which is called the order of the HOA representation. If the sound field is represented by a superposition of an infinite number of harmonic plane waves of different angular frequencies ω and arriving from all possible directions specified by the angle tuple (θ,ϕ), the respective plane wave complex amplitude function C(ω,θ,ϕ) can be expressed by the following Spherical Harmonics expansion:C(ω=kcs,θ,ϕ)=Σn=0NΣM=−nnCnm(k)Snm(θ,ϕ),where the expansion coefficients Cnm(k) are related to the expansion coefficients Anm(k) by Anm (k)=inCnm(k).
Assuming the individual coefficients Cnm(ω=kcs) to be functions of the angular frequency ω, the application of the inverse Fourier transform (denoted by −1(·)) provides time domain functions
            c      n      m        ⁡          (      t      )        =                    ℱ        t                  -          1                    ⁡              (                              C            n            m                    ⁡                      (                          ω              /                              c                s                                      )                          )              =                  1                  2          ⁢          π                    ⁢                        ∫                      -            ∞                    ∞                ⁢                                            C              n              m                        ⁡                          (                              ω                                  c                  s                                            )                                ⁢                      e                          i              ⁢                                                          ⁢              ω              ⁢                                                          ⁢              t                                ⁢          d          ⁢                                          ⁢          ω                    for each order n and degree m, which can be collected in a single vector c(t) by c(t)=[c00(t) c1−1(t) c10 (t) c11 (t) c2−2(t) c2−1(t) c20(t) . . . cNN-1(t) CNN (t)]T The position index of a time domain function cnm(t) within the vector c(t) is given by n(n+1)+1+m. The overall number of elements in the vector c(t) is given by O=(N+1)2. The discrete-time versions of the functions cnm(t) are referred to as Ambisonic coefficient sequences. A frame-based HOA representation is obtained by dividing all of these sequences into frames C(k) of length B and frame index k as follows:C(k):=[c((kB+1)TS)c((kB+2)TS) . . . c((kB+B)TS)],where TS denotes the sampling period. The frame C(k) itself can then be represented as a composition of its individual rows ci(k), i=1, . . . , O, as
      C    ⁡          (      k      )        =      [                                                      c              1                        ⁡                          (              k              )                                                                                      c              2                        ⁡                          (              k              )                                                            ⋮                                                                c              O                        ⁡                          (              k              )                                            ]  with ci(k) denoting the frame of the Ambisonic coefficient sequence with position index i. The spatial resolution of the HOA representation improves with a growing maximum order N of the expansion. Unfortunately, the number of expansion coefficients O grows quadratically with the order N, in particular O=(N+1)2. For example, typical HOA representations using order N=4 require O=25 HOA (expansion) coefficients. According to these considerations, the total bit rate for the transmission of HOA representation, given a desired single-channel sampling rate fs and the number of bits Nb per sample, is determined by O·fs·Nb. Consequently, transmitting a HOA representation of order N=4 with a sampling rate of fs=48 kHz employing Nb=16 bits per sample results in a bit rate of 19.2 MBits/s, which is very high for many practical applications, as e.g. streaming. Thus, compression of HOA representations is highly desirable.
Previously, the compression of HOA sound field representations was proposed in the European Patent applications EP2743922A, EP2665208A and EP2800401A. These approaches have in common that they perform a sound field analysis and decompose the given HOA representation into a directional and a residual ambient component.
The final compressed representation is assumed to comprise, on the one hand, a number of quantized signals, which result from the perceptual coding of the directional signals, and relevant coefficient sequences of the ambient HOA component. On the other hand, it is assumed to comprise additional side information related to the quantized signals, which is necessary for the reconstruction of the HOA representation from its compressed version.
Further, a similar method is described in ISO/IEC JTC1/SC29/WG11 N14264 (Working draft 1-HOA text of MPEG-H 3D audio, January 2014, San Jose), where the directional component is extended to a so-called predominant sound component. As the directional component, the predominant sound component is assumed to be partly represented by directional signals, i.e. monaural signals with a corresponding direction from which they are assumed to impinge on the listener, together with some prediction parameters to predict portions of the original HOA representation from the directional signals. Additionally, the predominant sound component is supposed to be represented by so-called vector based signals, meaning monaural signals with a corresponding vector which defines the directional distribution of the vector based signals. The known compressed HOA representation consists of I quantized monaural signals and some additional side information, wherein a fixed number OMIN out of these I quantized monaural signals represent a spatially transformed version of the first OMIN coefficient sequences of the ambient HOA component CAMB(k−2). The type of the remaining I−OMIN signals can vary between successive frames, and be either directional, vector based, empty or representing an additional coefficient sequence of the ambient HOA component CAMB(k−2).
A known method for compressing a HOA signal representation with input time frames (C(k)) of HOA coefficient sequences includes spatial HOA encoding of the input time frames and subsequent perceptual encoding and source encoding. The spatial HOA encoding 100, as shown in FIG. 1A, comprises performing Direction and Vector Estimation processing of the HOA signal in a Direction and Vector Estimation block 101, wherein data comprising first tuple sets DIR(k) for directional signals and second tuple sets VEC(k) for vector based signals are obtained. Each of the first tuple sets comprises an index of a directional signal and a respective quantized direction, and each of the second tuple sets comprising an index of a vector based signal and a vector defining the directional distribution of the signals. A next step is decomposing 103 each input time frame of the HOA coefficient sequences into a frame of a plurality of predominant sound signals XPS(k−1) and a frame of an ambient HOA component CAMB(k−1), wherein the predominant sound signals XPS(k−1) comprise said directional sound signals and said vector based sound signals. The decomposing further provides prediction parameters ξ(k−1) and a target assignment vector vA,T (k−1). The prediction parameters ξ(k−1) describe how to predict portions of the HOA signal representation from the directional signals within the predominant sound signals XPS(k−1) so as to enrich predominant sound HOA components, and the target assignment vector vA,T(k−1) contains information about how to assign the predominant sound signals to a given number I of channels.
The ambient HOA component CAMB(k−1) is modified 104 according to the information provided by the target assignment vector vA,T(k−1), wherein it is determined which coefficient sequences of the ambient HOA component are to be transmitted in the given number I of channels, depending on how many channels are occupied by predominant sound signals. A modified ambient HOA component CM,A(k−2) and a temporally predicted modified ambient HOA component CP,M,A(k−1) are obtained. Also a final assignment vector vA(k−2) is obtained from information in the target assignment vector vA,T(k−1). The predominant sound signals XPS(k−1) obtained from the decomposing, and the determined coefficient sequences of the modified ambient HOA component CM,A(k−2) and of the temporally predicted modified ambient HOA component CP,M,A(k−1) are assigned to the given number of channels, using the information provided by the final assignment vector vA(k−2), wherein transport signals yi(k−2), i=1, . . . , I and predicted transport signals yP,i(k−2), i=1, . . . , I are obtained. Then, gain control (or normalization) is performed on the transport signals yi(k−2) and the predicted transport signals yP,i(k−2), wherein gain modified transport signals zi(k−2), exponents ei(k−2) and exception flags (βi(k−2) are obtained.
As shown in FIG. 1B, the perceptual encoding and source encoding comprises perceptual coding of the gain modified transport signals zi(k−2), wherein perceptually encoded transport signals z̆l(k−2), i=1, . . . , I are obtained, encoding side information comprising said exponents ei(k−2) and exception flags βi(k−2), the first and second tuple sets DIR(k), VEC(k), the prediction parameters ξ(k−1) and the final assignment vector vA(k−2), and encoded side information Γ̆(k−2) is obtained. Finally, the perceptually encoded transport signals z̆l(k−2) and the encoded side information are multiplexed into a bitstream.