The present invention relates to the field of audio processing and audio coding, in particular to encoding and decoding slot positions of events in an audio signal frame.
Audio processing and/or coding has advanced in many ways. In particular, spatial audio applications have become more and more important. Audio signal processing is often used to decorrelate or render signals. Moreover, decorrelation and rendering of signals is employed in the process of mono-to-stereo-upmix, mono/stereo to multi-channel upmix, artificial reverberation, stereo widening or user interactive mixing/rendering.
Several audio signal processing systems employ decorrelators. An important example is the application of decorrelating signals in parametric spatial audio decoders to restore specific decorrelation properties between two or more signals that are reconstructed from one or several downmix signals. The application of decorrelators significantly improves the perceptual quality of the output signal, e.g. when compared to intensity stereo. Specifically, the use of decorrelators enables the proper synthesis of spatial sound with a wide sound image, several concurrent sound objects and/or ambience. However, decorrelators are also known to introduce artifacts like changes in temporal signal structure, timbre, etc.
Other application examples of decorrelators in audio processing are e.g. the generation of artificial reverberation to change the spatial impression or the use of decorrelators in multi-channel acoustic echo cancellation systems to improve the convergence behavior.
One important spatial audio coding scheme is Parametric Stereo (PS). FIG. 1 illustrates the structure of a mono-to-stereo decoder. A single decorrelator generates a decorrelated signal D (a “wet” signal) from a mono input signal M (a “dry” signal). The decorrelated signal D is then fed into a mixer along with the signal M. Then, the mixer applies a mixing matrix H to the input signals M and D to generate the output signals L and R. The coefficients in the mixing matrix H can be fixed, signal dependent or controlled by a user.
Alternatively, the mixing matrix is controlled by side information that is transmitted along with a downmix and contains the parametric description on how to upmix the signals of the downmix to form the desired multi-channel output. The spatial side information is usually generated during the mono downmix process in an accordant signal encoder.
Spatial audio coding as described above is widely applied, e.g., in Parametric Stereo. A typical structure of a parametric stereo decoder is shown in FIG. 2. In FIG. 2, decorrelation is performed in a transform domain. The spatial parameters can be modified by a user or additional tools, e.g. post-processing for binaural rendering/presentation. In this case, the upmix parameters are combined with the parameters from the binaural filters to compute the input parameters for the mixing matrix.
The output L/R of the mixing matrix H is computed from the mono input signal M and the decorrelated signal D.
      [                            L                                      R                      ]    =            [                                                  h              11                                                          h              12                                                                          h              21                                                          h              22                                          ]        ⁡          [                                    M                                                D                              ]      
In the mixing matrix, the amount of decorrelated sound fed to the output is controlled on the basis of transmitted parameters, e.g. Inter-Channel Level Differences (ILD), Inter-Channel Correlation/Coherence (ICC) and/or fixed or user-defined settings.
Conceptually, the output signal of the decorrelator output D replaces a residual signal that would ideally allow for a perfect decoding of the original L/R signals. Utilizing the decorrelator output D instead of a residual signal in the upmixer results in a saving of bitrate that would otherwise have been required to transmit the residual signal. The aim of the decorrelator is thus to generate a signal D from the mono signal M, which exhibits similar properties as the residual signal that is replaced by D. Reference is made to the document:
[1] J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, “High-Quality Parametric Spatial Audio Coding at Low Bitrates” in Proceedings of the AES 116th Convention, Berlin, Preprint 6072, May 2004.
Considering MPEG Surround (MPS), structures similar to PS termed One-To-Two boxes (OTT boxes) are employed in spatial audio decoding trees. This can be seen as a generalization of the concept of mono-to-stereo upmix to multichannel spatial audio coding/decoding schemes. In MPS, there also exist Two-To-Three upmix systems (TTT boxes) that may apply decorrelators depending on the TTT mode of operation. Details are described in the document:
[2] J. Herre, K. Kjörling, J. Breebaart, et al., “MPEG surround—the ISO/MPEG standard for efficient and compatible multi-channel audio coding,” in Proceedings of the 122th AES Convention, Vienna, Austria, May 2007.
With respect to Directional Audio Coding (DirAC), DirAC relates to a parametric sound field coding scheme that is not bound to a fixed number of audio output channels with fixed loudspeaker positions. DirAC applies decorrelators in the DirAC renderer, i.e., in the spatial audio decoder to synthesize non-coherent components of sound fields. Directional audio coding is further described in:
[3] Pulkki, Ville: “Spatial Sound Reproduction with Directional Audio Coding”, in J. Audio Eng. Soc., Vol. 55, No. 6, 2007
Regarding state-of-the-art decorrelators, reference is made to documents:
[4] ISO/IEC International Standard “Information Technology—MPEG audio technologies—Part1: MPEG Surround”, ISO/IEC 23003-1:2007.
[5] J. Engdegard, H. Purnhagen, J. Röden, L. Liljeryd, “Synthetic Ambience in Parametric Stereo Coding” in Proceedings of the AES 116th Convention, Preprint, May 2004.
IIR lattice allpass structures are used as decorrelators in spatial audio decoders like MPS [2,4]. Other state-of-the-art decorrelators apply (potentially frequency dependent) delays to decorrelate signals or convolve the input signals e.g. with exponentially decaying noise bursts. For an overview of state-of-the-art decorrelators for spatial audio upmix systems, reference is made to document [5]: “Synthetic Ambience in Parametric Stereo Coding”.
In general, stereo or multichannel applause-like signals coded/decoded in parametric spatial audio coders are known to result in reduced signal quality. Applause-like signals are characterized by containing rather dense mixtures of transients from different directions. Examples for such signals are applause, the sound of rain, galloping horses, etc. Applause-like signals often also contain sound components from distant sound sources that are perceptually fused into a noise-like, smooth background sound field.
Lattice allpass structures employed in spatial audio decoders like MPEG Surround act as artificial reverb generators and are consequently well-suited for generating homogenous, smooth, noise-like, inversive sounds (like room reverberation tails). However, they are examples of sound fields with a non-homogeneous spatio-temporal structure that are still immersing the listener: one prominent example are applause-like sound fields that create listener-envelopment not by only homogeneous noise-like fields, but also by rather dense sequences of single claps from different directions. Hence, the non-homogeneous component of applause sound fields may be characterized by a spatially distributed mixture of transients. These distinct claps are not homogeneous, smooth and noise-like at all.
Due to their reverb-like behavior, lattice allpass decorrelators are incapable of generating immersive sound fields with the characteristics, e.g. of applause. Instead, when applied to applause-like signals, they tend to temporally smear the transients in the signal. The undesired result is a noise-like immersive sound field without the distinctive spatio-temporal structure of applause-like sound fields. Further, transient events like a single handclap might evoke ringing artifacts of the decorrelator filters.
USAC (Unified speech and audio coding) is an audio coding standard for coding of speech and audio and a mixture thereof at different bitrates.
The perceptual quality of USAC can be further improved in stereo coding of applause and applause-like sounds at bitrates in the range of 32 kbps when parametric stereo coding techniques are applicable. USAC coded applause items tend to exhibit a narrow sound stage and a lack of envelopment if no dedicated applause handling is applied within the codec. To a large extent, stereo coding techniques of USAC and their limitations were inherited from MPEG Surround (MPS). However, USAC does offer a dedicated adaption for the requirement of proper applause handling. Said adaption is named Transient Steering Decorrelator (TSD) and is an embodiment of this invention.
Applause signals can be envisioned composed of single, distinct nearby claps temporally separated by a few milliseconds and superimposed noise-like ambience originating from very dense far-off claps. In parametric stereo coding at sensible side-information rate, the granularity of the spatial parameter sets (inter channel level difference, inter channel correlation, etc.) is much too low to ensure a sufficient spatial re-distribution of the single claps, leading to a lack of envelopment. Additionally, the claps are subject to processing by a lattice allpass decorrelator. This inevitably induces a temporal dispersion of the transients and further reduces the subjective quality.
Employing a Transient Steering Decorrelator (TSD) within the USAC decoder results in a modification of MPS processing. The underlying idea of such an approach is to address the applause decorrelation problem as follows:                Separate the transients in the QMF domain before the lattice allpass decorrelator, i.e.: split the decorrelator input signal into a transient stream s2 and a non-transient stream s1.        Feed the transient stream to a different parameter-controlled decorrelator, which is well-suited for transient mixtures.        Feed the non-transient stream to the MPS allpass decorrelator.        Add the outputs of both decorrelators, D1 and D2 to obtain the decorrelated signal D.        
FIG. 3 illustrates a One-To-Two (OTT) configuration within the USAC decoder. The U-shaped transient handling box of FIG. 3 comprises a parallel signal path as proposed for the transient handling.
Two parameters that guide the TSD process are transmitted as frequency independent parameters from the encoder to the decoder (see FIG. 3):                A binary transient/non-transient decision of a transient detector running in the encoder is used to control the transient separation with QMF time slot granularity in the decoder. An efficient lossless coding scheme is utilized for transmitting the transient QMF slot position data.        Actual transient decorrelator parameters, which are needed for the transient decorrelator to steer a spatial distribution of transients. The transient decorrelator parameters denote an angle between the downmix and its residual. These parameters are only transmitted for time slots which have been detected at the encoder to contain transients.        
In order to assess the quality of the above-described technology, two MUSHRA listening tests were conducted in a controlled listening test environment using high quality electrostatic STAX headphones. The testing was performed at 32 kbps and 16 kbps stereo configuration. Sixteen expert listeners participated in each of the tests.
Since the USAC test set does not contain applause items, additional applause items have been chosen to demonstrate the benefit of the proposed technology. The items listed in Table 1 have been included in the test:
TABLE 1Items of the listening test:ItemPropertiesARL_applauseapplause with low to medium density (MPS testset item)applause4svery dense applause containing few distinct clapsApplse_2chdense multi-channel applause - front channels(MPS testset item)Applse_stdense multi-channel applause - stereo downmix(MPS testset item)Klatschensparse applause signal
Regarding the regular twelve MPEG USAC listening test items, TSD is never active. However, these items do not remain exactly bit-identical since the TSD enable bit (indicating that TSD is off) is additionally included in the bitstream and thus slightly affects the bit-budget for the core-coder. Since these differences are very small, these items were not included in the listening test. Data is provided on the size of these differences to show that these changes are negligible and imperceptible.
A codec tool named inter-TES is part of USAC reference model 8 (RM8). Since this technique has been reported to improve the perceptual quality of transients including applause-like signals, inter-TES was switched on in every test condition. In such a setting, the best possible quality is insured and the orthogonality of inter-TES and TSD is demonstrated.
The system tests have the following configurations:                RM8: USAC RM8 system        CE: USAC RM8 system enhanced by the Transient Steering Decorrelator (TSD)        
FIGS. 4 and 5 depict the MUSHRA scores along with their 95% confidence intervals for the 32 kbps test scenario. For the test data, Student's t-distribution was assumed. The absolute scores in FIG. 4 show a higher mean score for all items, for four out of five items there is a significant improvement in the 95% confidence sense. No item was degraded versus RM8. The difference scores for USAC+TSD, as evaluated in a TSD core experiment (CE) with respect to USAC RM8 are plotted in FIG. 5. Here, a significant improvement for all items can be seen.
For the 16 kbps test setup, FIGS. 6 and 7 depict the MUSHRA scores along with their 95% confidence intervals. Student's t-distribution of the data was assumed. The absolute scores in FIG. 6 show higher mean score for every item. For one item, significance in the 95% confidence sense can be seen. No item scored worse than RM8. The difference scores are plotted in FIG. 7. Again, a significant improvement for all items with respect to different data was demonstrated.
The TSD tool is enabled by a bsTsdEnable flag transmitted in the bitstream. If TSD is enabled, the actual separation of transients is controlled by transient detection flags TsdSepData that are also transmitted in the bitstream and which are encoded in bsTsdCodedPos in case TSD is enabled.
In the encoder, the TSD enable flag bsTsdEnable is generated by a segmental classifier. The transient detection flags TsdSepData are set by a transient detector.
As already pointed out, TSD is not activated for the twelve MPEG USAC test items. For the five additional applause items TSD activation is depicted in FIG. 8, displaying a bsTsdEnable logic state versus time.
If TSD is activated, transients are detected in certain QMF time slots and these are subsequently fed to the dedicated transient decorrelator. For each additional test item, Table 2 lists percentages of slots within TSD activated frames which comprise transients.
TABLE 2Transient slot percentage (transient slot densityin % of all time slots of TSD frames)Transient slot densityItem(%)ARL_applause23.4Applause4s20.1applse_2ch24.7applse_st23.8Klatschen21.3
Transmitting transient separation decisions and decorrelator parameters from the encoder to the decoder does necessitate a certain amount of side information. However, this amount is overcompensated by the bitrate savings originating from the transmission of broadband spatial cues within MPS.
In consequence, the mean MPS+TSD side information bitrate is even lower than the plain MPS side information bitrate in plain USAC as listed in Table 3, first column. In the proposed configuration, as utilized for assessment of subjective quality, the mean bitrates listed in Table 3, second column, have been measured for TSD:
TABLE 3MPS(+TSD) Bitrates in bits/second within a32 kbps stereo codec scenario:MPS(+TSD) side informationmean bitrate (bits/sec.)Itemplain USAC RM8USAC with TSDARL_applause29662345Applause4s27542278applse_2ch30002544applse_st27352253Klatschen29502495
The computational complexity of TSD arises from                the transient slot position decoding        the transient decorrelator complexity.        
Assuming an MPEG Surround spatial frame length of 32 time slots, the slot position decoding necessitates (64 divisions+80 multiplications) per spatial frame in the worst case, i.e., 64*25+80=1680 operations per spatial frame.
Ignoring copy operations and conditional statements, the transient decorrelator complexity is given by one complex multiplication per slot and hybrid QMF band.
This leads to the following overall complexity numbers of TSD, shown in comparison to the plain USAC complexity numbers in Table 4:
TABLE 4TSD decoder complexity in MOPS and relative to plainUSAC decoder complexity:TSD:TSD: slotΣ(TSDplaintransientpositioncom-USACdecorrelatordecoderΣ(TSDplexity)com-com-com-com-relativeplexityplexityplexityplexity)toininininplainMOPSMOPSMOPSMOPSUSAC16 kbps8.70.1170.0240.1411.62%stereo(fs = 28.8kHz)32 kbps13.20.1630.0330.1961.48%stereo(fs = 40kHz)
In summary, the listening test data clearly shows a significant improvement of subjective quality of applause signals in the difference scores of all items in both operation points. In terms of absolute scores, all items in the TSD condition exhibit a higher mean score. For 32 kbps, a significant improvement exists for four out of five items. For 16 kbps, one item shows significant improvement. None of the items scored worse than RM8. An improvement is achieved at, as can be seen from the data on complexity, negligible computational costs. This further emphasizes the benefit of the TSD tool for USAC.
The above-described Transient Steering Decorrelator significantly improves audio processing in USAC. However, as has also been seen above, a Transient Steering Decorrelator necessitates information about the existence or non-existence of transients in a particular slot. In USAC, information about time slots may be transmitted on a frame-by-frame basis. A frame comprises several, e.g., 32 time slots. It is therefore appreciated that an encoder also transmits information about which slots comprise transients on a frame-by-frame basis. Reducing the number of bits to be transmitted is critical in audio signal processing. As even a single audio recording comprises a vast number of frames this means that even if the number of bits to be transmitted for each frame is reduced by just a few bits, the overall bit transfer rate can be significantly reduced.
The problem of decoding slot positions of events in an audio signal frame is however not limited to the problem of decoding transients. It would moreover be useful to decode slot positions of other events as well, such as, whether a slot of an audio signal frame is tonal (or not), whether it comprises noise (or whether it doesn't) and the like. In fact, an apparatus for efficiently encoding and decoding slot positions of events in an audio signal frame would be very useful for a large number of different sorts of events.
When this document refers to slots or slot positions of an audio signal frame, slots in this sense may be time slots, frequency slots, time-frequency slots or any other kind of slots. It is furthermore understood that the present invention is not limited to audio processing and audio signal frames in USAC, but instead refers to any kind of audio signal frames and any kind of audio formats, such as MPEG1/2, Layer 3 (“MP3”), Advanced Audio Coding (AAC), and the like. Efficiently encoding and decoding slot positions of events in an audio signal frame would be very useful for any kind of audio signal frame.