Any discussion of the background art throughout the specification should in no way be considered as an admission that such art is widely known or forms part of common general knowledge in the field.
Content creation, coding, distribution and reproduction of audio are traditionally performed in a channel based format, that is, one specific target playback system is envisioned for content throughout the content ecosystem. Examples of such target playback systems audio formats are mono, stereo, 5.1, 7.1, and the like, and we refer to these formats as different presentations of the original content. The above mentioned presentations are typically played back over loudspeakers but a notable exception is the stereo presentation which also commonly is played back directly over headphones.
One specific presentation is the binaural presentation, typically targeting playback on headphones. Distinctive to a binaural presentation is that it is a two-channel signal with each signal representing the content as perceived at, or close to, the left and right eardrum respectively. A binaural presentation can be played back directly over loudspeakers, but preferably the binaural presentation is transformed into a presentation suitable for playback over loudspeakers using cross-talk cancellation techniques.
Different audio reproduction systems have been introduced above, like loudspeakers in different configurations, for example stereo, 5.1, and 7.1, and headphones. It is understood from the examples above that a presentation of the original content has a natural, intended, associated audio reproduction system, but can of course be played back on a different audio reproduction system.
If content is to be reproduced on a different playback system than the intended one, a downmixing or upmixing process can be applied. For example, 5.1 content can be reproduced over a stereo playback system by employing specific downmix equations. Another example is playback of stereo encoded content over a 7.1 speaker setup, which may comprise a so-called upmixing process, that could or could not be guided by information present in the stereo signal. A system capable of upmixing is Dolby Pro Logic from Dolby Laboratories Inc (Roger Dressler, “Dolby Pro Logic Surround Decoder, Principles of Operation”, www.Dolby.com).
An alternative audio format system is an audio object format such as that provided by the Dolby Atmos system. In this type of format, objects or components are defined to have a particular location around a listener, which may be time varying. Audio content in this format is sometimes referred to as immersive audio content. It is noted that within the context of this application an audio object format is not considered a presentation as described above, but rather a format of the original content that is rendered to one or more presentations in an encoder, after which the presentation(s) is encoded and transmitted to a decoder.
When multi-channel and object based content is to be transformed into a binaural presentation as mentioned above, the acoustic scene consisting of loudspeakers and objects at particular locations is simulated by means of head-related impulse responses (HRIRs), or binaural room impulse responses (BRIRs), which simulate the acoustical pathway from each loudspeaker/object to the ear drums, in an anechoic or echoic (simulated) environment, respectively. In particular, audio signals can be convolved with HRIRs or BRIRs to re-instate inter-aural level differences (ILDs), inter-aural time differences (ITDs) and spectral cues that allow the listener to determine the location of each individual loudspeaker/object. The simulation of an acoustic environment (reverberation) also helps to achieve a certain perceived distance. FIG. 1 illustrates a schematic overview of the processing flow for rendering two object or channel signals xi 10, 11, being read out of a content store 12 for processing by 4 HRIRs e.g. 14. The HRIR outputs are then summed 15, 16, for each channel signal, so as to produce headphone speaker outputs for playback to a listener via headphones 18. The basic principle of HRIRs is, for example, explained in Wightman, Frederic L., and Doris J. Kistler. “Sound localization.” Human psychophysics. Springer New York, 1993. 155-192.
The HRIR/BRIR convolution approach comes with several drawbacks, one of them being the substantial amount of convolution processing that is required for headphone playback. The HRIR or BRIR convolution needs to be applied for every input object or channel separately, and hence complexity typically grows linearly with the number of channels or objects. As headphones are often used in conjunction with battery-powered portable devices, a high computational complexity is not desirable as it may substantially shorten battery life. Moreover, with the introduction of object-based audio content, which may comprise say more than 100 objects active simultaneously, the complexity of HRIR convolution can be substantially higher than for traditional channel-based content.
For this purpose, co-pending and non-published U.S. Provisional Patent Application Ser. No. 62/209,735, filed Aug. 25, 2015, describes a dual-ended approach for presentation transformations that can be used to efficiently transmit and decode immersive audio for headphones. The coding efficiency and decoding complexity reduction are achieved by splitting the rendering process across encoder and decoder, rather than relying on the decoder alone to render all objects.
A part of the content which during creation is associated with a specific spatial location is referred to as an audio component. The spatial location can be a point in space or a distributed location. Audio components can be thought of as all the individual audio sources that a sound artist mixes, i.e., positions spatially, into a soundtrack. Typically a semantic meaning (e.g. dialogue) is assigned to the components of interest so that the goal of the processing (e.g. dialogue enhancement) becomes defined. It is noted that audio components that are produced during content creation are typically present throughout the processing chain, from the original content to different presentations. For example, in an object format there can be dialogue objects with associated spatial locations. And in a stereo presentation there can be dialogue components that are spatially located in the horizontal plane.
In some applications, it is desirable to extract dialogue components in the audio signal, in order to e.g. enhance or amplify such components. The goal of dialogue enhancement (DE) may be to modify the speech part of a piece of content that contains a mix of speech and background audio so that the speech becomes more intelligible and/or less fatiguing for an end-user. Another use of DE is to attenuate dialogue that for example is perceived as disturbing by an end-user. There are two fundamental classes of DE methods: encoder side and decoder side DE. Decoder side DE (called single ended) operates solely on the decoded parameters and signals that reconstruct the non-enhanced audio, i.e., no dedicated side-information for DE is present in the bitstream. In encoder side DE (called dual ended), dedicated side-information that can be used to do DE in the decoder is computed in the encoder and inserted in the bitstream.
FIG. 2 shows an example of dual ended dialogue enhancement in a conventional stereo example. Here, dedicated parameters 21 are computed in the encoder 20 that enable extraction of the dialogue 22 from the decoded non-enhanced stereo signal 23 in the decoder 24. The extracted dialogue is level modified, e.g. boosted 25 (by an amount partially controlled by the end-user) and added to the non-enhanced output 23 to form the final output 26. The dedicated parameters 21 can be extracted blindly from the non-enhanced audio 27 or exploit a separately provided dialogue signal 28 in the parameter computations.
Another approach is disclosed in U.S. Pat. No. 8,315,396. Here, the bitstream to the decoder includes an object downmix signal (e.g. a stereo presentation), object parameters to enable reconstruction of the audio objects, and object based metadata allowing manipulation of the reconstructed audio objects. As indicated in FIG. 10 of U.S. Pat. No. 8,315,396, the manipulation may include amplification of speech related objects. This approach thus requires the reconstruction of the original audio objects on the decoder side, which typically is computationally demanding.
There is a general desire to provide dialogue estimation efficiently also in a binaural context.