The present invention relates to processing signals and, in particular, audio signals in the frequency domain.
In many fields of signal processing, filter characteristics are changed at runtime. Frequently, a gradual smooth transition is necessitated here to prevent interferences by switching (for example, discontinuities in the signal path, in the case of audio signals audible click artifacts). This may be performed either by a continuous interpolation of the filter coefficients or simultaneously filtering the signal by two filters and subsequently gradually crossfading the filtered signals. Both methods provide identical results. This functionality will be referred to as “crossfading” below.
When filtering by FIR-Filters, which is also referred to as linear convolution, considerable increases in performance can be achieved by using fast convolution algorithms. These methods operate in the frequency domain and operate on a block-by-block basis. Frequency-domain convolution algorithms, such as Overlap-Add and Overlap-Save (among others [8]; [9]), partition only the input signal, but not the filter and consequently use large FFTs (Fast Fourier Transform), resulting in high latencies when filtering. Partitioned convolution algorithms, partitioned either uniformly [10]; [11] or non-uniformly [12]; [13]; [20], also divide the filters (or impulse responses thereof) into smaller segments. By applying the frequency-domain convolution to these partitions, a corresponding delay and combination of the results, a good trade-off between the FFT size used, latency and complexity can be achieved.
However, it is common to all methods of fast convolution that they are only very difficult to combine with gradual filter crossfading. On the one hand, this is due to the block-by-block mode of operation of these algorithms. On the other hand, interpolation of intermediate values between different filters, as arise in the case of a transition, would result in a considerably increased computing burden, since these interpolated filter sets each first have to be transformed to a form suitable for applying fast convolution algorithms (this usually necessitates segmentation, zero padding and an FFT Operation). For “smooth” crossfading, these operations have to be performed quite frequently, thereby considerably reducing the performance advantage of fast convolutions.
Solutions described so far may particularly be found in the field of binaural synthesis. Thus, either the filter coefficients of the FIR filters are interpolated, followed by a convolution in the time domain [5] (remark: the gradual exchange of filter coefficients in this publication is referred to as “commutation”). [14] describes crossfading between FIR filters by applying two fast convolution operations, followed by crossfading in the time domain. [16] deals with exchanging filter coefficients in non-uniformly partitioned convolution algorithms. Thus, both crossfading and exchange strategies for the partitioned impulse response blocks (aiming at gradual crossfading) are considered.
From an algorithmic point of view (however, for a different application), a method, described in [18], for post-smoothing a spectrum obtained by the FFT comes closest to the solution described here. There, applying a special time-domain window (of a cosine type, such as, for example, a Hann or Hamming window) is implemented by a convolution in the frequency domain using a frequency-domain windowing function of only 3 elements. Crossfading or fading-in or fading-out signals is not provided for there as an application; in addition, the method described there is based on fixed 3-elements frequency-domain windows which are based on windows known in DSP, and does not exhibit a flexibility in order to adjust complexity and quality of the approximation to a predetermined window function (and, consequently, nor does the design method for the sparsely occupied window functions). On the other hand, [18] does neither consider using the overlap-safe method, nor the possibility of not having to determine defaults for certain parts of the time-domain window function.
Binaural synthesis allows a realistic reproduction of complex acoustic scenes via headphones which is applied to many fields, such as, for example, immersive communication [1], auditory displays [2], virtual reality [3] or augmented reality [4]. Rendering dynamic acoustic scenes, in that dynamic head movements of the listeners are also considered, improves the localizing quality, realism and plausibility of binaural synthesis considerably, but also increases the computing complexity as regards rendering. A different, usually applied way of improving the localizing precision and naturalness is adding spatial reflections and reverberation effects, for example [1], [5], for example by calculating a number of discrete reflections for each sound object and rendering these as additional sound objects. Again, such techniques increase the complexity of binaural rendering considerably. This emphasizes the importance of efficient signal processing techniques for binaural synthesis.
The general signal flow of a dynamic binaural synthesis system is shown in FIG. 4. The signals of the sound objects are filtered by the head-related transfer functions (HRTFs) of both ears. A summation of these contributions provides the signal of the left and right ears which are reproduced by headphones. HRTFs map sound propagation from the source position to the ear drum and vary in dependence on the relative position—depending on the azimuth, elevation and, within certain limits, also on the distance [6]. Thus, dynamic sound scenes necessitate filtering using temporally varying HRTFs. Generally, two techniques which are mutually related, but separate, are necessitated in order to implement such temporally varying filters: HRTF interpolation and filter crossfading. In this context, interpolation refers to determining HRTFs for a certain source position which is usually indicated by azimuth and elevation coordinates. Since HRTFs are usually provided in databases of a finite spatial resolution, for example [7], this includes selecting a suitable sub-set of HRTFs and interpolation between these filters [3], [6]. Filter crossfading, which in [5] is referred to as “commutation”, allows a smooth transition, distributed over a certain transition time, between these, potentially interpolated, HRTFs. Such gradual transitions are necessitated in order to avoid audible signal discontinuities, such as, for example, click noises. The present document focuses on the crossfading process.
Due to the conventionally large number of sound objects, filtering the source signals by the HRTFs contributes considerably to the complexity of binaural synthesis. A suitable way of decreasing this complexity is applying frequency-domain (FD) convolution techniques, such as Overlap-Add or Overlap-Save methods [8], [9], or partitioned convolution algorithms, for example [10] to [13]. A common disadvantage of all the FD convolution methods is that an exchange of filter coefficients or a gradual transition between filters is restricted more strongly and usually necessitates a higher computing complexity than crossfading between time-domain filters. On the one hand, this may be attributed to the block-based mode of operation of these methods. On the other hand, the requirement of transferring the filters to a frequency-domain representation entails a considerable reduction in performance with frequent filter changes. Consequently, a typical solution for filter crossfading includes two FD convolution processes using different filters and subsequently crossfading the outputs in the time domain.