Spatial sound reproduction beyond simple stereo has become commonplace through applications such as home cinema systems. Typically such systems use loudspeakers positioned at specific spatial positions. In addition, systems have been developed that provide a spatial sound perception from headphones. Conventional stereo reproduction tends to provide sounds that are perceived to originate inside the user's head. However, systems have been developed which provide a full spatial sound perception based on binaural signals provided directly to the user's ears by earphones/headphones. Such systems are often referred to as virtual sound systems as they provide a perception of virtual sound sources at positions where no real sound source exists.
Virtual surround sound is a technology that attempts to create the perception that there are sound sources surrounding the listener which are not physically present. In such systems, the sound does not appear to originate from inside the user's head as is known from conventional headphone reproduction systems. Rather, the sound may be perceived to originate outside the user's head, as is the case in natural listening in absence of headphones. In addition to a more realistic experience, virtual surround audio also tends to have a positive effect on listener fatigue and speech intelligibility.
In order to achieve this perception, it is necessary to employ some means of tricking the human auditory system into thinking that a sound is coming from the desired positions. A well-known approach for providing the experience of virtual surround sound is the use of binaural recording. In such approaches, the recording of sound uses a dedicated microphone arrangement and is intended for replay using headphones. The recording is either made by placing microphones in the ear canal of a subject or a dummy head, which is a bust that includes pinnae (outer ears). The use of such a dummy head including pinnae provides a very similar spatial impression to the impression the person listening to the recordings would have if present during the recording. However, because each person's pinnae are unique, and the filtering they impose on sound depends on the directional incidence of the incoming soundwave is accordingly also unique, localization of sources is subject dependent. Indeed, the specific features used to localize sources are learned by each person from early childhood. Therefore, any mismatch between pinnae used during recording and those of the listener may lead to a degraded perception, and erroneous spatial impressions.
By measuring the impulse responses from a sound source at a specific location in three dimensional space to the microphones in the dummy head's ears for each individual, the so called Head Related Impulse Responses (HRIR) can be determined. HRIRs can be used to create a binaural recording simulating multiple sources at various locations. This can be realized by convolving each sound source with the pair of HRIRs that corresponds to the position of the sound source. The HRIR may also be referred to as a Head Related Transfer Function (HRTF). Thus, the HRTF and HRIR are equivalents. In the case that the HRIR also includes a room effect these are referred to as Binaural Room Impulse Responses (BRIRs). BRIRs consist of an anechoic portion that only depends on the subject's anthropometric attributes (such as head size, ear shape, etc), followed by a reverberant portion that characterizes the combination of the room and the anthropometric properties.
The reverberant portion contains two temporal regions, usually overlapping. The first region contains so-called early reflections, which are isolated reflections of the sound source on walls or obstacles inside the room before reaching the ear-drum (or measurement microphone). As the time lag increases, the number of reflections present in a fixed time interval increases, now also containing higher-order reflections.
The second region in the reverberant portion is the part where these reflections are not isolated anymore. This region is called the diffuse or late reverberation tail. The reverberant portion contains cues that give the auditory system information about distance of the source and size and acoustical properties of the room. Furthermore it is subject dependent due to the filtering of the reflections with the HRIRs. The energy of the reverberant portion in relation to that of the anechoic portion largely determines the perceived distance of the sound source. The density of the (early-) reflections contributes to the perceived size of the room. The T60 reverberation time is defined as the time it takes for reflections to drop 60 dB in energy level. The reverberation time gives information on the acoustical properties of the room; whether its walls are very reflective (e.g. bathroom) or whether there is much absorption of sound (e.g. bed-room with furniture, carpet and curtains), as well as the volume (size) of the room.
Besides the use of measured impulse responses incorporating a certain acoustic environment, synthetic reverberation algorithms are often employed, because of the ability to modify certain properties of the acoustic simulation, and because of their relatively low computational complexity.
An example of a system that uses virtual surround techniques is MPEG Surround which is one of the major advances in multi-channel audio coding recently standardized by MPEG (ISO/IEC 23003-1:2007, MPEG Surround).
MPEG Surround is a multi-channel audio coding tool that allows existing mono- or stereo-based coders to be extended to multi-channel. FIG. 1 illustrates a block diagram of a stereo core coder extended with MPEG Surround. First the MPEG Surround encoder creates a stereo downmix from the multi-channel input signal. The stereo downmix is coded into a bit-stream using a core encoder, e.g. HE-AAC. Next, spatial parameters are estimated from the multi-channel input signal. These parameters are encoded into a spatial bit-stream. The resulting core coder bit-stream and the spatial bit-stream are merged to create the overall MPEG Surround bit-stream. Typically the spatial bit-stream is contained in the ancillary data portion of the core coder bit-stream. At the decoder side, the core and spatial bit-stream are first separated. The stereo core bit-stream is decoded in order to reproduce the stereo downmix. This downmix together with the spatial bit-stream is input to the MPEG Surround decoder. The spatial bit-stream is decoded resulting in the spatial parameters. The spatial parameters are then used to upmix the stereo downmix in order to obtain the multi-channel output signal which is an approximation of the original multi-channel input signal.
Since the spatial image of the multi-channel input signal is parameterized, MPEG Surround also allows for decoding of the same multi-channel bit-stream onto rendering devices other than a multichannel speaker setup. An example is virtual reproduction on headphones, which is referred to as the MPEG Surround binaural decoding process. In this mode a realistic surround experience can be provided using regular headphones.
FIG. 2 illustrates a block diagram of the stereo core codec extended with MPEG Surround where the output is decoded to binaural. The encoder process is identical to that of FIG. 1. After decoding the stereo bit-stream, the spatial parameters are combined with the HRTF/HRIR data to produce the so-called binaural output.
Building upon the concept of MPEG Surround, MPEG has standardized a ‘Spatial Audio Object Coding’ (SAOC) (ISO/IEC 23003-2:2010, Spatial Audio Object Coding).
From a high level perspective, in SAOC, instead of channels, sound objects are efficiently coded. Whereas in MPEG Surround, each speaker channel can be considered to originate from a different mix of sound objects, in SAOC these individual sound objects are, to some extent, available at the decoder for interactive manipulation. Similarly to MPEG Surround, a mono or stereo downmix is also created in SAOC where the downmix is coded using a standard downmix coder, such as HE-AAC. Object parameters are encoded and embedded in the ancillary data portion of the downmix coded bitstream. At the decoder side, by manipulation of these parameters, the user can control various features of the individual objects, such as position, amplification/attenuation, equalization, and even apply effects such as distortion and reverb.
The quality of virtual surround rendering of stereo or multichannel content can be significantly improved by so-called phantom materialization, as described in Breebaart, J., Schuijers, E. (2008). “Phantom materialization: A novel method to enhance stereo audio reproduction on headphones.” IEEE Trans. On Audio, Speech and Language processing 16, 1503-1511.
Instead of constructing a virtual stereo signal by assuming two sound sources originating from the virtual loudspeaker positions, the phantom materialization approach decomposes the sound signal into a directional signal component and an indirect/decorrelated signal component. The direct component is synthesized by simulating a virtual loudspeaker at the phantom position. The indirect component is synthesized by simulating virtual loudspeakers at the virtual direction(s) of the diffuse sound field. The phantom materialization process has the advantage that it does not impose the limitations of a speaker setup onto the virtual rendering scene.
Virtual spatial sound reproduction has been found to provide very attractive spatial experiences in many scenarios. However, it has also been found that the approach may in some scenarios result in experiences that do not completely correspond to the spatial experience that would result in a real world scenario with actual sound sources at the simulated positions in three dimensional space.
It has been suggested that the spatial perception of virtual audio rendering may be affected by interference in the brain between the positional cues provided by the audio and the positional cues provided by the user's vision.
In daily life, visual cues are (typically subconsciously) combined with audible cues to enhance the spatial perception. One example is that a person's intelligibility increases when his lip movements can also be observed. In another example, it has been found that a person can be tricked by providing a visual cue to support a virtual sound source, e.g. by placing a dummy speaker at a location where a virtual sound source is generated. The visual cue will thus enhance or modify the virtualization. A visual cue can to a certain extent even change the perceived location of a sound source as in the case of a ventriloquist. Conversely, the human brain has trouble in localizing sound sources that do not have a supporting visual cue (for instance in wavefield synthesis), which is actually contradictory to human nature.
Another example is the leakage of external sound sources from the listener's environment that are mixed with the virtual sound sources generated by a headphone-based audio system. Depending on the audio content and user location, the acoustic properties of the physical and virtual environments may differ considerably, resulting in ambiguity with respect to the listening environment. Such mixtures of acoustical environments may cause unnatural and unrealistic sound reproduction.
There are still many aspects related to the interaction with visual cues that are not well understood, and indeed the effect of visual cues in relation to virtual spatial sound reproduction is not fully understood.
Hence, an improved audio system would be advantageous and in particular an approach allowing increased flexibility, facilitated implementation, facilitated operation, improved spatial user experience, improved virtual spatial sound generation and/or improved performance would be advantageous.