1. Field of the Invention
The present invention relates to an apparatus and method of producing three-dimensional (3D) sound, and, more specifically, to producing a virtual acoustic environment (VAE) in which multiple independent 3D sound sources and their multiple reflections are synthesized by acoustical transducers such that the listener's perceived virtual sound field approximates the real world experience. The apparatus and method have particular utility in connection with computer gaming, 3D audio,.stereo sound enhancement, reproduction of multiple channel sound, virtual cinema sound, and other applications where spatial auditory display of 3D space is desired.
2. Description of Related Information
The ability to localize sounds in three-dimensional space is important to humans in terms of awareness of the environment and social contact with each other. This ability is vital to animals, both as predator and as prey. For humans and most other mammals, three-dimensional hearing ability is based on the fact that they have two ears. Sound emitted from a source that is located away from the median plane between the two ears arrives at each ear at different times and at different intensities. These differences are known as interaural time difference (ITD) and interaural intensity difference (IID). It has long been recognized that the ITD and IID are the primary cues for sound localization. ITD is primarily responsible for providing localization cues for low frequency sound (below 1.0 kHz), as the ITD creates a distinguishable phase difference between the ears at low frequencies. On the other hand, because of head shadowing effects, IID is primarily responsible for providing localization cues for high frequency (above 2.0 kHz) sounds.
In addition to interaural time difference (ITD) and interaural intensity difference (IID), head-related transfer functions (HRTFs) are essential to sound localization and sound source positioning in 3D space. HRTFs describe the modification of sound waves by a listener's external ear, known as the pinnae, head, and torso. In other words, incoming sound is “transformed” by an acoustic filter which consists of pinna, head, and torso. The manner and degree of the modification is dependent upon the incident angle of the sound source in a sort of systematic fashion. The frequency characteristics of HRTFs are typically represented by resonance peaks and notches. Systematic changes of the notches and peaks of the positions in the frequency domain with respect to elevation change are believed to provide localization cues.
ITD and IID have long been employed to enhance the spatial aspects of stereo system effects, however the sound images created are perceived as within the head and in between the two ears when a headphone set is used. Although the sound source can be lateralized, the lack of filtering by HRTF causes the perceived sound image to be “internalized,” that is, the sound is perceived without a distance cue. This phenomenon can be experienced by listening to a CD using a headphone set rather than a speaker array. Using HRTFs to filter the audio stream can create a more realistic spatial image; this results in images with sharper elevation and distance perception. This allows sound images to be heard through a headphone set as if the images are from a distance away with an apparent direction, even if the image is on the median plan where the ITD and IID diminish. Similar results can be obtained with a pair of loudspeakers when cross-talk between the ears and two speakers is resolved.
Commercial 3-D audio systems known in the art are using all the three localization cues, including HRTF filtering, to render 3-D sound images. These systems demand a computing load uniformly proportional to the number of sources simulated. To reproduce multiple, independent sound sources, or to faithfully account for reflected sound, a separate HRTF must be computed for each source and each early reflection. The total number of such sources and reflections can be large, making the computation costs prohibitive to a single DSP solution. To address this problem, systems known in the art either limit the number of sources positioned or use multiple DSPs in parallel to handle multi-source and reflected audio reproduction with a proportionally increased system cost.
The known art has pursued methods of optimizing HRTF processing. For example, the principal component analysis (PCA) method uses principal components modeled upon the logarithmic amplitude of HRTFs. Research has shown that five principal components, or channels of sound, enable most people to localize the sound waves as well as in a free field. However, the non-linear nature of this approach limits it to a new way of analyzing HRTF data (amplitude only), but does not enable faster processing of HRTF filtering for producing 3D audio.
A need exists for a simple and economical method that can reliably reproduce 3-D sound without using an exponential array of DSPs. Another optimization method, the spatial feature extraction and regularization (SFER) model, constructs a model HRTF data covariance matrix and applys eigen decomposition to the data covariance matrix to obtain a set of M most significant eigen vectors. According to the Karhunen-Loeve Expansion (KLE) theory each of the HRTFs can be expressed as a weighted sum of these eigen vectors. This enables the SFER model to establish linearity in the HRTF model, allowing the HRTF processing efficiency issue to be addressed. The SFER model has also been used in the time domain. That is, instead of working on HRTFs that are defined in a frequency domain as transfer functions, the later work applied KLE to head-related impulse responses (HRIRs). HRIRs represent a time domain counterpart of HRTFs. Though, in principal, the later approach is equivalent to the frequency domain SFER model, working with HRIRs has the additional advantage of avoiding complex calculations, which is a very favorable change in DSP code implementation.