Normal humans are able to hear and localize sounds coming from all directions and distances because the soundwaves reaching the left and right ears each on one side of a human head have time delays, which are known as Interaural Time Differences (ITDs), and/or volume differences, which are known as Interaural Level Differences (ILDs). The brain can interpret and determine the sound spatial origin with these auditory cues and perceive sound in three-dimensions (3D).
Based on this concept, binaural recording of sound uses two microphones arranged in way mimicking a pair of normal human left and right ears to generate a sound recording embedded with 3D audio cues with the intent to create a 3D audio experience for the listener of the playback of the sound recording (also known as “dummy head recording”). The problem, however, is in the playback or reproduction of the 3D audio recording using commonly available stereo transducers. Even when the recorded left and right audio channel signals are playback separately from the left and right transducers respectively, the soundwaves corresponding to the left audio channel signal cannot be assured to reach only the listener's left ear, and vice versa for the right audio channel signal. As the time delay and/or volume differences information recorded with the original sound cannot be reproduced perfectly at the listener's left and right ears the listener cannot experience the 3D sound effect. This phenomenon is called crosstalk. FIG. 1 illustrates this crosstalk phenomenon.
A number of existing techniques have been proposed to cancel this crosstalk so to reproduce an uncorrupted 3D audio experience for a listener. Crosstalk Cancellation (XTC) can be achieved by playing back binaural material over speakers (BAL) or headphones (BAH). Most of the BAL techniques involve effecting XTC by manipulating the time domain and/or audio frequency spectrum of the input audio signals, essentially creating a XTC filter. The audio frequency spectrum manipulation can be done by adjusting variables of the XTC filter to match the response of a sound reproduction system, which includes a pair of transducers, the room within which the reproduction is made, the location of the listener in the room, and in some cases even the size and shape of the listener's head. In some implementations, the adjustment is done automatically by first measuring the response of the sound reproduction system. Then, using the inversion of this system response to convolve with the input audio signals to the transducers to remove the system response. FIG. 2 provides a simplified illustration of the working of the XTC filter in a sound reproduction system.
The biggest challenge with BAL is the influence of the listening room. Early reflections and reflections in general, will all deteriorate the level of crosstalk cancellation that an XTC algorithm can achieve in real life. One can try to mitigate the issue of reflections by either deadening the room with broadband absorbers, or using speakers with a narrow dispersion pattern (significant level drop-off off-axis). In many real-life implementations, neither solution is practical. Then there is the problem of a single sweet spot. Even though XTC can be used in combination with listener head-tracking, it is essentially still a single sweet spot. There is really no freedom of movement for the listener to speak of. Multiple XTC sweet spots is possible by using Phase Array or beam forming techniques, but the design becomes extremely complex and very costly to implement. Such system may be able to provide a few sweet spots, but not feasible in an environment such as a movie theatre.
The BAH techniques involve a general or individualized Head Related Transfer Function (HRTF) being convolved with the audio signal in order to trick the human brain into perceiving sound in 3D. However, the 3D sound experience in BAH is still not as convincing as BAL. Visual cues are often necessary as aid to trick the brain into believing that the sound is in true 3D. The effect generated by BAH techniques ultimately lack the ‘physicality’ of sound that one can experience with BAL. BAH is also extremely difficult to implement due to the highly individualized HRTF.
FIG. 3 illustrates an exemplary embodiment of a sound reproduction system with XTC filter. However, one common drawback of these XTC techniques in practice is that they require the listener to be at a single location that is unobstructed from the transducers (sweet-spot) and remain stationary, or the location of the listener must be known to or tracked by the system throughout the whole audio playback in order to achieve the ideal 3D audio experience.