Generally, in multi-channel reproduction and listening, a listener is surrounded by multiple loudspeakers. Various methods exist to capture audio signals for specific setups. One general goal in the reproduction is to reproduce the spatial composition of the originally recorded sound event, i.e. the origins of individual audio sources, such as the location of a trumpet within an orchestra. Several loudspeaker setups are fairly common and can create different spatial impressions. Without using special post-production techniques, the commonly known two-channel stereo setups can only recreate auditory events on a line between the two loudspeakers. This is mainly achieved by so-called “amplitude-panning”, where the amplitude of the signal associated to one audio source is distributed between the two loudspeakers, depending on the position of the audio source with respect to the loudspeakers. This is normally done during recording or subsequent mixing. That is, an audio source coming from the far-left with respect to the listening position will be mainly reproduced by the left loudspeaker, whereas an audio source in front of the listening position will be reproduced with identical amplitude (level) by both loudspeakers. However, sound emanating from other directions cannot be reproduced.
Consequently, by using more loudspeakers that are distributed around the listener, more directions can be covered and a more natural spatial impression can be created. The probably most well known multi-channel loudspeaker layout is the 5.1 standard (ITU-R775-1), which consists of 5 loudspeakers, whose azimuthal angles with respect to the listening position are predetermined to be 0°, ±30° and ±110°. That means, during recording or mixing, the signal is tailored to that specific loudspeaker configuration and deviations of a reproduction setup from the standard will result in decreased reproduction quality.
Numerous other systems with varying numbers of loudspeakers located at different directions have also been proposed. Professional and special systems, especially in theaters and sound installations, do also include loudspeakers at different heights.
A universal audio reproduction system named DirAC has been recently proposed which is able to record and reproduce sound for arbitrary loudspeaker setups. The purpose of DirAC is to reproduce the spatial impression of an existing acoustical environment as precisely as possible, using a multi-channel loudspeaker system having an arbitrary geometrical setup. Within the recording environment, the responses of the environment (which may be continuous recorded sound or impulse responses) are measured with an omnidirectional microphone (W) and with a set of microphones allowing to measure the direction of arrival of sound and the diffuseness of sound. In the following paragraphs and within the application, the term “diffuseness” is to be understood as a measure for the non-directivity of sound. That is, sound arriving at the listening or recording position with equal strength from all directions, is maximally diffuse. A common way to quantify diffusion is to use diffuseness values from the interval [0, . . . , 1], wherein a value of 1 describes maximally diffuse sound and value of 0 describes perfectly directional sound, i.e. sound emanating from one clearly distinguishable direction only. One commonly known method of measuring the direction of arrival of sound is to apply 3 figure-of-eight microphones (XYZ) aligned with Cartesian coordinate axes. Special microphones, so-called “SoundField microphones”, have been designed, which directly yield all the desired responses. However, as mentioned above, the W, X, Y and Z signals may also be computed from a set of discrete omnidirectional microphones.
Another method to store audio formats for arbitrary number of channels to one or two downmix channels of audio with accompanying directional data has been recently proposed by Goodwin and Jot. This format can be applied to arbitrary reproduction systems. The directional data, i.e. the data having information about the direction of audio sources is computed using “Gerzon vectors”, which consist of a velocity vector and an energy vector. The velocity vector is a weighted sum of vectors pointing at loudspeakers from the listening position, wherein each weight is the magnitude of a frequency spectrum at a given time/frequency tile for a loudspeaker. The energy vector is a similarly weighted vector sum. However, the weights are short-time energy estimates of the loudspeaker signals, that is, they describe a somewhat smoothed signal or an integral of the signal energy contained in the signal within finite length time-intervals. These vectors share the disadvantage of not being related to a physical or a perceptual quantity in a well-grounded way. For example, the relative phase of the loudspeakers with respect to each other is not properly taken into account. That means, for example, if a broadband signal is fed into the loudspeakers of a stereophonic setup in front of a listening position with opposite phase, a listener would perceive sound from ambient direction, and the sound field in the listening position would have sound energy oscillations from side to side (e.g. from the left side to the right side). In such a scenario, the Gerzon vectors would be pointing towards the front direction, which is obviously not representing the physical or the perceptual situation.
Naturally, having multiple multi-channel formats or representations in the market, the requirement exists to be able to convert between the different representations, such that the individual representations may be reproduced with setups originally developed for the reconstruction of an alternative multi-channel representation. That is, for example, a transformation between the 5.1 channels and 7.1 or 7.2 channels may be required to use an existing 7.1 or 7.2 channel playback setup for playing back the 5.1 multi-channel representation commonly used on DVD. The great variety of audio formats makes the audio content production difficult, as all formats require specific mixes and storage/transmission formats. Therefore, conversion between different recording formats for playback on different reproduction setups is necessary.
There are a number of methods proposed to convert audio in a specific audio format to another audio format. However, these methods are always tailored to specific multi-channel formats or representations. That is, these are only applicable to the conversion from one specific predetermined multi-channel representation into another specific multi-channel representation.
Generally, a reduction in the number of reproduction channels (so-called “downmix”) is simpler to implement that an increase in the number of reproduction channels (“upmix”). For some standard loudspeaker reproduction setups, recommendations are provided by, for example, the ITU on how to downmix to reproduction setups with a lower number of reproduction channels. In these so-called “ITU” downmix equations, the output signals are derived as simple static linear combinations of input signals. Usually, a reduction of the number of reproduction channels leads to a degradation of the perceived spatial image, i.e. a degraded reproduction quality of a spatial audio signal.
For a possible benefit from a high number of reproduction channels or reproduction loudspeakers, upmixing techniques for specific types of conversions have been developed. An often investigated problem is how to convert 2-channel stereophonic audio for reproduction with 5-channel surround loudspeaker systems. One approach or implementation to such a 2-to-5 upmix is to use a so-called “matrix” decoder. Such decoders have become common to provide or upmix 5.1 multi-channel sound over stereo transmission infrastructures, especially in the early days of surround sound for movies and home theatres. The basic idea is to reproduce sound components which are in-phase in the stereo signal in the front of the sound image, and to put out-of-phase components into the rear loudspeakers. An alternative 2-to-5 upmixing method proposes to extract the ambient components of the stereo signal and to reproduce those components via the rear loudspeakers of the 5.1 setup. An approach following the same basic ideas on a perceptually more justified basis and using a mathematically more elegant implementation has been recently proposed by C. Faller in “Parametric Multi-channel Audio Coding: Synthesis of Coherence Cues”, IEEE Trans. On Speech and Audio Proc., vol. 14, no. 1, Jan. 2006.
The recently published standard MPEG surround performs an upmix from one or two downmixed and transmitted channels to the final channels used in reproduction or playback, which is usually 5.1. This is implemented either using spatial side information (side information similar to the BCC technique) or without side information, by using the phase relations between the two channels of a stereo downmix (“non-guided mode” or “enhanced matrix mode”).
All methods for format conversion described in the previous paragraphs are specialized to be applied to specific configurations of both the source and the destination audio reproduction format and are thus not universal. That is, a conversion between arbitrary input multi-channel representations to arbitrary output multi-channel representations cannot be performed. That is to say the prior art transformation techniques are specifically tailored to the number of loudspeakers and their precise position for the input multi-channel audio representation as well as for the output multi-channel representation.
It is, naturally, desirable to have a concept for multi-channel transformation which is applicable to arbitrary combinations of input and output multi-channel representations.