Recently, it has become feasible to store and playback larger amounts of music on portable devices. As a consequence, the use of such devices became very popular, especially as the musical content can be played back via headphones everywhere. Normally, the content to be played back has been mixed in stereo, i.e., to two independent channels. However, the production has been performed for a playback via loudspeakers, using a common two-channel stereo-equipment. That is, the stereo-channels have been mixed in a music-studio such as to provide maximum reproduction quality, and, as far as possible, the spatial perception of the original auditory scene using two loudspeakers. However, listening to such stereo recordings via headphones leads to in-head localization of the sound, that is to a strongly disturbing spatial impression. In other words, virtual sound sources, which are meant to be localized somewhere between the two loudspeakers, are localized inside the listener's head due to psychoacoustic properties of the human auditory system. This is the case since no crosstalk and no reflexions are perceived, which irritates the auditory system such that the sound sources is localized in the listener's head. The irritation is caused since the auditory system is used to those signal properties, when content is played back via loudspeakers, or, more generally, transmitted via a “real” environment.
Several methods and devices have been proposed to address this problem by processing the left and right channels prior to the playback via headphones. However, these approaches, as for example the use of head related transfer functions, are computationally very complex. These approaches try to stimulate the human auditory system to localize the sound sources outside the head when playing back music with headphones by simulating the listening situation of loudspeakers in a room. That is, for example, a cross-talk sound path and the reflections of the room's walls are artificially added to the signal. To achieve a realistic simulation, filtering has to be applied to the left and the right channel to further take into account the properties of the listener's torso, head and pinnae. The more accurate this kind of simulation is, the more computational resources are required. When fairly well-sounding results are to be received with reduced complexity, those models are, for example, reduced to cross-talk, and, in some cases, to a very small number of wall reflections, which can be implemented by low-order filtering. The influence of the human body itself can also be approximated by low order filters. However, these filters have to be used on the direct signal as well as on each of the reflected signals (as e.g. described in M. R. Schroeder: An Artificial Stereophonic Effect Obtained from Using a Single Signal, 9th annual meeting of the AES, preprint 14, 1957).
Other methods have been proposed to provide a stereophonic listening experience, even when only a monophonic signal is provided. One approach is to feed the input signal (monophonic) to both channels and to create an attenuated and delayed representation of the signal, which is then added to the first channel and subtracted from the second channel.
Often, stereo signals are also transformed in to a mid-side representation containing a mid-signal (sum-signal) and a side-signal (difference signal). The sum-signal is formed by summing up the right channel and the left channel and the difference signal is formed by building the difference of the left channel and the right channel. In most musical stereo-signals, the virtual sound sources of highest relevance are those localized in front of the listener. This is the case, since these commonly represent the leading voice or the leading instrument in the recording. As these sound sources are intended to be localized between the loudspeakers of a two-channel setup, these signal components are present in the left channel as well in the right channel. Therefore, these important signals are mainly represented by a sum-signal (mid-signal) and hardly by a different signal (side-signal). Therefore, when attempting to achieve a localization out of a listener's head, such a mid-side representation has to be processed with great care.
In conventional out-of-head signal processing based on sum and difference signals, the sum-signals remain either unprocessed, or are individually processed or filtered by specific filters. However, simply filtering the sum signal and the side signal separately, and redistributing the signals to the left and right channels leads to an increase of the out-of-head localization or the perceived spatial width at the cost of an unadvantageously high computational complexity. Furthermore, an adding (subtracting) of a filtered sum signal to the difference signal, as performed by a conventional mid-side-upmixer, results in a shift of the perceived position of the virtual sound sources within the output signal.
Given the conventional generation of stereo-signals and the changed playback habits, the need exists to provide a concept for the generation of a stereo signal with enhanced perceptual quality, which can be efficiently implemented.