Audio scene analysis is important in augmented reality applications. In augmented reality applications additional layers such as additional audio layers or visual layers can be overlaid upon the user's own senses to provide a richer and more information laden environment for the user to explore. One part of audio scene analysis is spatial audio scene estimation and context extraction whereby the environment surrounding the user and the device is analysed in order that the additional layer when overlaid does not distract the user but instead provides a synergistic effect when perceived by the user.
Augmented reality audio layers in order not to be distracting are required to be aligned to the current context of the user. That is when adding an artificial audio source within the audio scenery, the content must not sound unnatural and should therefore be aimed at providing a user experience which is as natural as possible. This for example is not the case if a source reverberation of the augmented reality audio layer is different to that of the audio scenery of the environment surrounding the user and device. For example where the user is operating in a highly reverberant subway station, the augmented content requires also to have “reverberation” to not to sound unnatural. Therefore in order to accomplish this goal the augmentation engine requires an accurate estimate of the given audio scenery including a reliable reverberation estimate.
Audio scene analysis can thus for example feature parameter estimation such as the reverberation time of a given acoustic environment surrounding the device. The estimation of the reverberation time can be a challenge for acoustic experts as reliable estimation of the reverberation time in real-time applications and particularly for mobile devices with limited audio capture and computational resources is difficult. For example the estimation of reverberation time is typically computationally (heavy) requiring extensive processor power in order to produce real-time results.
The determination of the reverberation time is a fundamental cue not only in preparing audio scenery for example to augment audio content but also in audio processing and audio capture in real-time communication, for example in teleconferencing.
Audio processing functionality and performance, for example in a handsfree operation and especially for teleconference equipment can be improved when the audio context of the meeting room is known. For example noise suppression and audio beamforming algorithms can be tuned when the room reverberation time is known with sufficient accuracy.
Reverberation estimates have been typically conducted using mono audio systems whereby decaying audio events from a received signal are detected and the reverberation time from this event calculated. In some cases the estimator detects an impulse type sound event from which the decaying tail reveals the reverberation conditions of the environment. Furthermore in some estimators the estimator can detect signals which are slowly decaying by nature, wherein the observed decay rate is a combination of both the source signal decay and the environmental reverberation decay. The reverberation estimator typically assumes that the observed decay rate therefore provides an upper bound for the reverberation parameter, in other words when the decay rate of the actual source signal is not known the true reverberation time of the given space cannot be higher than the estimated parameter from the observed event. However finding a decaying signal tail is not straightforward, especially in circumstances where there is a continuous signal the reverberation tail or decay may be short and hidden within the short term signal structure and background noise.
Reverberation time estimators typically record a representative audio signal or monitor a given audio image. The received audio content is then analysed either within the device capturing the audio signal or the signal is transmitted to a more computationally complex device to conduct the reverberation time analysis and estimation.
Typically reverberation time (RT) is defined as the time taken by sound to decay 60 decibels (dB) below the initial level. The decay constant τ is related to the reverberation time using the equation RT=6.91 τ.
There have been proposed two approaches to estimate reverberation time of a given space using only the available audio recordings. The first approach is to assume that the recorded audio is a function of the original sound source and the room response of the space including the reverberation. In this case the recorded signal can be written as y(n)=Σk g(k)x(n−k)+v(n) where x(n) is the true sound source signal, g(k) is the room model and v(n) is the measurement noise. Since the estimation process does not have knowledge of the true sound source, in other words the measurement is not taken on a sound source supplied in order to be tested, then the method is typically called “blind estimation”.
To find the reverberation time the recorded signal is reviewed and searched for decaying tails within the signal. The energy level of the signal is determined by taking short frames of the audio signal and determining a beginning of an audio event when the short term energy level exceeds the average energy level. The succeeding frames following the beginning of an audio event are then stored in a buffer until the corresponding energy levels drop below the average background level. The audio event is then considered as being ended when the frame energy falls below the long term average energy value. The recorded audio signal buffer can then be analysed as a decaying tail of an audio event. The start time (Ts) of the decaying tail is determined by detecting the location after which the signal energy starts to decay or according to some examples using coherent information of the audio signal. The end time (Te) of the event can also be determined as the point at which the energy level falls below the background noise level. When the start and end points are available a method such as defined in Schroeder (M. R. Schroeder “A new method of measuring a reverberation time”, Journal of The Acoustical Society of America, Vol. 37, 1965) can be applied to therefore calculate the reverberation time.
The average of the squared decaying sound pressure at a point in the room excited by filtered white noise is equal to a certain integral over the squared impulse response g2(t) hence the decay ratio of the audio event can be calculated as an integral of the squared room response.
            d      2        ⁡          (      t      )        =      N    ⁢                  ∫        t                  T          e                    ⁢                                    g            2                    ⁡                      (            t            )                          ⁢                                  ⁢                  ⅆ          t                    
The room impulse response can thus be determined by using equipment playing back band pass random pulses and recording the corresponding audio in a given room. However in practice the true signal which causes the detected audio event can be considered as an impulse. Hence the recorded signal can be applied as such to the room response signal. Where N from the above equation being considered to be proportional to the power spectral density of the noise in the measurement, the integration lower limit is t=Ts, . . . , Te.
The decaying rate of the decaying tail of the given audio event can now be defined by line fitting the achieved curve d2(t) within the interval t=Ts, . . . , Te. When the time difference of the start and stop points is known together with the decay rate it is known how to determine the decay time τ needed for a 60 dB drop in the signal energy.
However as described above this approach to estimate the decay time (and from this the reverberation time RT) is computationally complex and requires significant processing within the device to occur.
It has also been proposed to calculate the reverberation time by applying further model information whereby the decaying tail of an audio event is modelled as a function of a decaying factor y(n)=a(n)nx(n) in which y(n) is the recorded audio signal, x(n) is the audio signal source and a(n) is the decay coefficient defined in the range of a(n)=[0 . . . 1). In other words that the range is asymptotically approaching unity. For example, the equation below indicates that a(n) cannot reach unity at any value of “tau”. In such a model the mapping between the decay factor a(n) and the reverberation time can be defined as a(n)=e(−1/τ(n)).
The problem with both methods is that it requires significant processing capability of which is not typically available on a mobile device. Furthermore even at low sampling rates and with critically sampled band pass domain estimation the requirement in terms of instruction processing to generate a signal estimate is high.
In order to obtain a reliable estimate and detect a suitable audio event containing a proper decaying tail the analysis needs to be conducted over several seconds as a significant amount of sampled data has to be stored even before processing occurs. It has been proposed that a collaborative context analysis is used in which the detected audio component, in other words the recorded audio signals, are provided to a more sophisticated device such as another mobile device with more computational power or a server providing a corresponding reverberation time estimation service. In such proposals the audio signal is conveyed to the more sophisticated device as part of a communication. However such a process requires an initial encoding in order that the signal is to be transmitted and then a subsequent decoding with associated further processing requirements even before the analysis is started.
As such there appears to be significant problems with implementing either of the above reverberation estimation techniques.