Multi-channel audio contents are nowadays more and more common in consumer electronics. Immersive audio is becoming a common feature in many multi-media and communication systems. However, immersive audio often requires a reproduction layout involving a great number of loudspeakers, as for example the 22.2 layout. This is a major constraint for products such as mobile devices (smart phones, tablets, etc.) as well as teleconferencing applications, home cinema applications, high fidelity (HI-FI) applications and so forth, which output audio signals with only two loudspeakers or headphones, where the signals are output in a left and a right audio output channel.
Binauralization, which is also called “virtual surround”, is the binaural presentation of multi-channel audio signals to a listener using headphones, left/right loudspeakers or other transaural apparatus (binaural over loudspeakers). One way to carry out binauralization is to render each loudspeaker and the related feeding signal as a virtual source, that is binaurally filtering the feeding signal to obtain the perception of the real loudspeaker even using headphones. In order to binaurally render each loudspeaker and related feeding signal, the signal is filtered with binaural RIRs (BRIRs), corresponding to the position of the loudspeaker in a given room, wherein the BRIRs are determined and measured at the virtual listener position.
Generally, a RIR is the response of the room to the excitation of a point source, measured at one point. Typically, to measure the RIRs in a room, the room is excited with a loudspeaker and the response is measured using a microphone at different positions. The respective two channel response is called BRIR as explained in relation to FIG. 1 if the response to the excitation is measured with microphones mounted in the ears of a dummy head.
The BRIRs encode the transfer function between the respective loudspeaker and the two ears (left and right) of the listener.
An example for the binaural filtering process is represented in FIG. 1, where Hix represent the impulse response of the loudspeaker fed by the channel i signal to the X (X can be L for left or R for right) ear of the listener. The capital letter H stands for the frequency domain, while the small letter h stands for the time domain representation of the impulse response. As schematically shown in FIG. 1, the listener 100 is at a virtual position in the room and two loudspeakers 105 (speaker 1) and 110 (speaker 2) at respective different positions in the room emit audio waves, which are received by the left ear (L) and the right ear (R) of the user 100. As shown in FIG. 1, there is a pair of impulse responses H for each of the speakers 105, and 110.
The signal processing involved in binauralization can lead to a high computational complexity, especially for high quality applications. The complexity is related to the filtering of a multi-channel input signal with the binaural RIRs, BRIRs. In particular, using BRIRs which can easily exceed tens of thousands samples, the complexity can become extremely high. Furthermore, multi-channel architectures may consist of a high number of channels, such as for example 22 channels in the 22.2 speaker layout. (For the 2 Low Frequency Effect (LFE), channels typically a different processing is used as these do not contribute to the localization of sources).
In order to reduce computational complexity for binauralization applications, an impulse response in a room is usually divided in two parts, which is also visualized in the reflectogram plot of an example RIR, shown in FIG. 2, namely into the direct path and early reflections (D&E) part and into the reverberation tail (late part). A different binauralization strategy is then used for the two parts.
The transition point between the D&E part and the late part is called mixing time. The mixing time can be expressed in an actual time value (e.g. nanosecond (ns), millisecond (ms), second (s)) or in a sample value representing a time point. In general we talk about a sample time which covers both expressions of the mixing time. The early reflections are a set of discrete reflections whose density increases until the individual reflections can no longer be discriminated or perceived. While the direct sound in the D&E part is a single event that can be easily identified, the early reflections and the late reverberation of an impulse response in a room are more difficult to distinguish and to label, as can be seen in the example of a RIR amplitude/time diagram shown in the example of FIG. 3.
The estimation and determination of the mixing time is a rather well studied topic in the prior art and several solutions are suggested.
The first groups of approaches are model-based methods, which assume that some prior knowledge of properties of the room exists such as the volume or geometry. Here, the mixing time is determined based for example on a threshold of the density of reflections in the room, or the threshold of a mean free path in the room. The reflections density and the mean free path can be mathematically related to some room properties such that the mixing time can be computed in closed form. The limitations and problems for the first group of approaches is that a prior knowledge of room properties is necessary. Typically, the results of these approaches are not very precise as they are not based on the real room but just on a model of the room. The quality of the results strongly depends on the quality of the model and the fitting of the real room to the model.
The second group of approaches uses a single measured RIR to estimate the mixing time. The second group of approaches is based on signal-based methods and uses threshold estimation, setting for example a threshold of (Gaussian) stochasticity, a threshold of memory, threshold of reflections detectability, threshold of phase randomness. The mixing time is then fixed at the time (or the sample) where a given metric is below or above the given threshold. The evaluation of these approaches, however, is problematic because there is no clear definition of the mixing time.
In order to have a meaningful reference, several prior art studies perform a perceptional analysis of RIRs in order to define a perceptional mixing time in subjective listening tests. Such studies typically exploit multiple RIRs measured in the same room at different positions. In some cases, model-based, signal-based estimators and perceptual estimations are merged using regression methods. Generally, the statistical approaches have limited consistency and deliver non-robust estimates of the mixing time. The statistical methods tend to provide a noisy detection curve so that applying a threshold on such curves is error-prone: small variations of the curve lead to large variations of the mixing time estimate. Furthermore, down-sampled subband domain representations of the RIRs or the BRIRs, obtained with techniques such as Quadrature Mirror Filter (QMF), are required for the Moving Picture Experts Group (MPEG) binauralization frame work. Signal-based algorithms have not been evaluated in such context so far. However, considering the limited robustness of full band RIRs, it is reasonable to assume that the performance will not be adequate in the down-sampled subband domain: shorter analysis windows may lead to a statistical inaccuracy (length of the window, typically 1024 samples, divided by a number of subbands, typically 64), changes in the fine structure passing from full band RIR to down-sampled subband RIR may lead to estimation inaccuracy.