Binaural hearing is a field of research that aims to understand the mechanisms allowing human beings to perceive the spatial origin of sounds. Based on the postulate that the morphology of an individual is what allows him to determine the spatial origin of sounds, it is in particular recognized in this field that elements of paramount importance are the position and shape of the ears of an individual. Specifically, the ears act as directional frequency filters on sounds that reach them.
Although the relationships between morphology and audition have been studied for a very long time, over the last twenty-five years a growing interest has been observed among the scientific community in the problem of customization, i.e. of how to take into account individual-specific attributes.
In particular, attention has been given to the customization of HRTFs, mathematical representations of the frequency coloration of the sounds that we perceive. The expression “frequency coloration” is understood to mean variations in audio-signal power spectral density. The spectra of white, pink or even gray noise are examples thereof. Many methods are now known, which may be classified into two broad families: synthetic methods, which aim to calculate or recreate sets of HRTFs; and adaptive methods, which aim to discover, from a given set of HRTFs, possibly at the cost of minor transformations, the transfer function most suited to an individual.
Among synthetic methods, mention may first be made of the exact calculations of probabilistic and statistical approaches.
Developed over more than twenty years, the family of finite-element methods aims to model then solve the problem, expressed in the form of partial derivatives, of propagation of sound from its source to the eardrum of the subject. This family in particular contains the following methods: the direct boundary element method (DBEM); the indirect boundary element method (IBEM); the infinite/finite element method (IFEM); and the fast-multipole boundary element method (FM-BEM).
Reputed to offer exact solutions to the addressed problem, these methods nevertheless have several notable drawbacks. Firstly, a 3D mesh of the subject must be generated. Although this is not a problem per se, the higher the frequencies at which it is desired to calculate the HRTFs the finer the mesh must be, and as the fineness of the mesh increases (i.e. as the reliability desired for the high-frequency results increases) calculation time also increases and rapidly becomes prohibitive. The expression “high frequencies” is understood to mean frequencies above 4 kHz. Lastly, to physically model the problem requires, a priori, many approximations to be made. Thus, each surface is attributed a specific impedance (quantifying absorption/reflection effects) the value of which is empirical. Likewise, hair is conventionally modelled by a surface of different impendence to the skin, this model obviously not taking into account the bulky nature of hair.
An alternative approach to direct calculation of HRTFs consists in determining the main modes of variation from a representative set of real HRTFs.
This is in particular what Sylvain Busson did in his work (“Individualisation d'Indices Acoustiques pour la Synthèse Binaurale” [Customization of Acoustic Indices for Binaural Sythesis]; PhD thesis, Université de la Méditerranée-Aix-Marseille II, 2006) on artificial neural networks (ANNs). The idea studied in this thesis was that of predicting HRTFs on the basis of measurement of a limited number thereof. This was in particular done by conjoint implementation of a self-organizing map and an ascending hierarchical classification (AHC), before election of representative HRTFs. Subsequently, a three-layer multi-layer perceptron (MLP) neural network was constructed and the representative HRTFs of 44 subjects from the CIPIC database used by way of learning set. Although promising, this work neither found any universal representants, i.e. representants common to all individuals, nor presented a psycho-acoustic validation of the results. In addition, it is also necessary to make provision for a way of accessing said representants.
Statistical methods for synthesizing HRTFs may, as a variant, be based on principal components analysis (PCA).
Kistler and Wightman (“A model of head-related transfer functions based on principal components analysis and minimum-phase reconstruction”; The Journal of the Acoustical Society of America, 91(3):1637-1647, 1992) were the first to suggest decomposing HRTFs using this method. The set of HRTFs is then considered a vectorial subspace of the measurement space. Knowledge of a basis of this subspace then allows any representant thereof, i.e. any HRTF, to be determined via simple linear combination of basis vectors. This is what PCA makes possible by delivering an orthonormal basis of the space generated by the learning HRTFs. The last step of the solution of the customization problem then consists in finding the relationship between the morphological parameters of individuals and the reconstruction coefficients, with the eigenvectors of the basis. To do this, multiple linear regressions are conventionally used.
On the basis of the work of Kistler & Wightman, Xu et al. (Song Xu, Zhizhong Li, and Gavriel Salvendy: “Improved method to individualize head-related transfer function using anthropometric measurements”; Acoustical Science and Technology, 29(6):388-390, 2008) suggested grouping the HRTFs of the various measured individuals depending on specified direction (azimuth, elevation) before performing the PCA (one per group), with the aim of thus reducing estimation errors.
Zhang et al. (R. A. Kennedy M. Zhang and T. D. Abhayapala; “Statistical method to identify key anthropometric parameters in hrtf individualization”; In Joint Workshop on Hands-free Speech Communication and Microphone Arrays, 2011) for their part suggested a statistical method for estimating the most relevant anthropometric parameters for implementation of the regression step.
In 2007, Vast Audio Pty Ltd filed a patent (G. Jin, P. Leong, J. Leung, S. Carlile, and A. Van Schaik; “Generation of customized three dimensional sound effects for individuals”, Apr. 24, 2007, U.S. Pat. No. 7,209,564) inspired by these ideas. In fact, the latter first describes the creation of a HRTF database and of a database of morphological parameters. Next, mention is made of use of a method of statistical analysis to decompose the HRTF and parameter spaces into elementary components, in the manner made possible by PCA. Subsequently, using another method of statistical analysis, relationships between the reconstruction coefficients of the morphological parameters and those of the HRTFs are determined.
Each method proposed up to now has generally allowed the results of prior methods to be improved without however generating an outcome that is completely satisfactory from the psycho-acoustic point of view i.e. under real conditions. In particular, the number and location of the required morphological parameters are very imprecise. In addition, in the case of simultaneous analysis of morphology and HRTFs, discovery of the relationships between the coefficients of the two spaces is all the more complex if the data are left in raw form.
Another type of synthetic method notable for its innovative character is the reconstruction of HRTFs using an Bayesian approach. It was suggested by Hofman & Van Opstal (Paul M Hofman and A John Van Opstal. Bayesian; “reconstruction of sound localization cues from responses to random spectra”, Biological cybernetics, 86(4):305-316, 2002), who wanted to recreate potential HRTFs on the basis of a probabilistic analysis of the responses of studied subjects to very precise stimuli. More particularly, the idea was to make subjects listen to sounds convolved with filters mimicking the types of variations observable in actual HRTFs, the sounds being emitted by a loudspeaker located directly in front of the subjects. The subjects were asked to look with their eyes in the direction from which the sound seemed to be coming.
Although innovative, this method however has many drawbacks that do not work in its favor, such as the time required to perform the experiment or the inability to study HRTFs for sounds corresponding to positions outside of the subject's field of gaze, the subject being required to indicate with his eyes the directions from which the sounds seem to be coming.
Whereas the aforementioned synthetic methods aim to create new sets of HRTFs from scratch (without however ever having observed real examples thereof, contrary to finite-element methods) adaptive methods in contrast aim to model actual examples as closely as possible. The underlying idea consists in performing measurements on actual subjects in order to obtain sets of HRTFs that are valid for at least one person. They therefore necessarily contain a sufficient number of localization indices to be usable, something that synthetic methods cannot guarantee.
Selective methods make no alterations to the measurements; the principle in common is election of a set of HRTFs from a plurality according to certain criteria. The latter are most often psycho-acoustic, without however being limited thereto.
With respect to psycho-acoustic criteria, mention will first be made of the work by Shimada et al. (Shoji Shimada, Nobuo Hayashi, et Shinji Hayashi; “A clustering method for sound localization transfer functions”, Journal of the Audio Engineering Society, 42(7/8):577-584, 1994). Starting with a substantial database of HRTFs, said authors grouped similar HRTFs together. To do this, a 16-coefficient cepstral decomposition was performed. The Euclidian distance naturally associated with this 16-dimensional space then allowed the HRTFs to be grouped into clusters (of 8 in number). Sets of HRTFs were then randomly chosen within the clusters and subjects invited to choose the one or more clusters that gave them the best impression of externality and directivity.
The reader may also refer to the more recent work by Tame et al. (Robert P Tame, Daniele Barchiese, and Anssi Klapuri; “Headphone virtualization: Improved localization and externalization of nonindividualized hrtfs by cluster analysis”, in Audio Engineering Society Convention 133; Audio Engineering Society, May 2012) or even the work by Xie et al. (Bosun Xie and Zhaojun Tian; “Improving binaural reproduction of 5.1 channel surround sound using individualized hrtf cluster in the wavelet domain”, in Audio Engineering Society Conference: 55th International Conference: Spatial Audio, Audio Engineering Society, August 2014) who respectively used Gaussians and a wavelet decomposition to group the HRTFs.
Once the cluster has been selected, another selecting step in which a very precise set is selected may be added. Once again, multiple methods have been published. For example, Y. Iwaya (Yukio Iwaya, “Individualization of head-related transfer functions with tournament-style listening test: Listening with other's ears”, Acoustical science and technology, 27(6): 340-343, 2006) describes a procedure for selecting a set of HRTFs from 32 available HRTFs, this procedure applying a tournament-type principle. An audio path in a horizontal plane is simulated by convolving a pink noise with the sets of HRTFs. A pink noise is a noise the audio power of which is constant for a given frequency bandwidth in a logarithmic space (e.g. the same power is emitted in the 40-60 Hz band as in the 4000-6000 Hz band). 32 paths were therefore obtained and placed in competition. In each bout, the subject declared one of two paths to be victorious, this path being the one that most closely resembled the right path. The set that won the tournament was declared to be the best one for the subject.
Seeber et al. (Bernhard U Seeber and Hugo Fastl; “Subjective selection of non-individual head-related transfer functions”, July 2003) present another approach to selecting, in two steps, one set among 12. The stated objective is for the selection to be fast, to require no prior training and to deliver a result minimizing the number of inside-the-head localizations. The first step consists in extracting the 5 sets providing the best results in terms of spatial perception in the frontal area. The second step consists in eliminating 4 depending on how well various behaviors (such as movement of an audio source at constant speed, at constant elevation or even at constant distance) are reproduced. About ten minutes is required to carry out the procedure.
Lastly, mention is also made of the approach of Martens (William L Martens; “Rapid psychophysical calibration using bisection scaling for individualized control of source elevation in auditory display”; in Proc. Int. Conf. on Auditory Display, pages 199-206, July 2002) which is referred to as bisection scaling. The idea is to create, using a psycho-acoustic test, a look-up table containing the correspondence between the actual directions associated with a set of HRTFs and the directions perceived by the subject. In practice, for a given azimuth, it is necessary to the find the HRTF that best corresponds to the sensation of an elevation of 45°. The elevation extrema (0° and 90°) being assumed to be perceived correctly, a second-order polynomial interpolation is then performed to construct the aforementioned table.
Yet other protocols have been proposed by the scientific community but none allow the drawbacks inherent to this type of methodology to be avoided. Specifically, even if the objective is not to find the exact HRTFs of the subject (it would be necessary to implement a synthetic method) but to select or adapt as best as possible an existing set, the quality of the best possible solution nevertheless remains limited by the variability in the sets of HRTFs open to selection. Thus, with a given protocol, the results obtained improve as the size of the database of input data increases. However, increasing the size of the database of input data increases the length of the required experimentation, this being undesirable, in particular as active subject participation is required.
Placing emphasis on the importance of the specific morphology of each individual, Zotkin et al. (D. N. Zotkin, J. Hwang, R. Duraiswaini, and L. S. Davis; “Hrtf personalization using anthropometric measurements”, in Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on, pages 157-160, October 2003) describe the ear by way of seven morphological parameters that are measurable in a profile image of the ear. These parameters allow an inter-individual distance to be defined, which is used to select, in the CIPIC database, the nearest neighbor of a given subject. It will be noted that the HRTFs thus selected are then modified for frequencies lower than 3 kHz. Specifically, at low frequencies (f≤500 Hz), a head-and-torso (HAT) model is used to synthesize the HRTFs. Between 500 Hz and 3 kHz, an affine transformation is carried out in order to gradually pass from the synthetic HRTFs to the selected HRTFs.
In 2001, the company Arkamys and the CNRS filed a patent (B. F. Katz and D. Schönstein, “Procédé de selection de filtres hrtf perceptivement optimale dans une base de données à partir de paramètres morphologiques” [“Method for selecting perceptually optimal HRTF filters in a database according to morphological parameters”] WO2011128583) relating to a morphology-based selection method. The idea was to build three databases, the first containing the HRTFs of a set of individuals, the second containing a set of morphological parameters of these individuals, and the third containing the listening preferences of these individuals i.e., for each subject, his classification of the HRTFs in the first database. Once these databases created, a study of the correlations between the second and third databases is carried out in order to sort the morphological parameters in order of importance. A dimensional analysis of the HRTF space (for example a PCA) is carried out in order to obtain a basis in which the HRTFs are representable. The relationships between the K most important morphological parameters and the coordinates of the HRTFs in the aforementioned space are then calculated, establishing a link between morphology and HRTFs. Given a new individual, carrying out the aforementioned measurement of the K morphological parameters then allows his position in the HRTF space to be determined. The nearest neighbor in database is sought and forms the result of the personalization.
The problem encountered in the preceding methods using morphological parameters is that of how to define the number and location of these parameters. Specifically, the notion, for example, of the height of an ear is not something that has a natural definition, and measurement thereof will be very dependent on measurer subjectivity as he will, first of all, have to determine whether the ear must be turned and where the “highest” and “lowest” points are located. Moreover, the question arises as to the criteria to use to define the distance used because it is on the latter that the result of the selection depends.
Lastly come adapted-selection methods, the most prominent example of which is doubtlessly frequency scaling, introduced by Middlebrooks (John C Middlebrooks, “Virtual localization improved by scaling nonindividualized external-ear transfer functions in frequency”, The Journal of the Acoustical Society of America, 106(3), 1493-1510, 1999); this operation is based on the idea that the interaction of an audio source of given frequency with a solid depends on the dimensions of the latter. In particular, any homothetic transformation of an object must be accompanied, if it is still desired to observe the same interaction, by a homothetic transformation of inverse ratio in frequency. Applied to customization, this idea amounts to saying that, if the HRTFs of a reference individual (or even of a dummy head) and the scaling factor between the morphology of this reference and that of a subject for whom customization is required are known, it is possible to improve the localization sensation achieved with the reference HRTFs by applying thereto a scaling of inverse ratio.
In parallel to frequency scaling, Maki and Furukawa (Katuhiro Maki and Shigeto Furukawa; “Reducing individual differences in the external-ear transfer functions of the Mongolian gerbil; The Journal of the Acoustical Society of America, 118(4), 2005) have shown that, starting with the datum of the angle between a reference external-ear and a test external-ear, a rotation of the coordinate system giving the direction of the HRTFs allows inter-individual differences to be significantly decreased. In other words, this method takes advantage of the fact that a rotation of the external-ear of a subject induces an identical rotation in the measured HRTFs.
Although useful, these approaches nevertheless do not, considered in isolation, form complete personalization methods. Such methods must decrease HRTF variability to only 1 or 2 parameters. However, the above approaches may be seen as complementing other methods well.
Despite the many known approaches aiming to personalize binaural sounds, not one has yet clearly stood out from the rest in terms of its effectiveness and simplicity. In addition, each thereof may lead to problems such as prohibitive personalization times or unreliable solutions, or indeed both of these simultaneously.