The present invention relates to techniques as to how to improve the perception of a direction of origin of a reconstructed audio signal. In particular, the present invention proposes an apparatus and a method for reproduction of recorded audio signals such that a selectable direction of audio sources can be emphasized or over-weighted with respect to audio signals coming from other directions.
Generally, in multi-channel reproduction and listening, a listener is surrounded by multiple loudspeakers. Various methods exist to capture audio signals for specific set-ups. One general goal in the reproduction is to reproduce the spatial composition of the originally recorded signal, i.e. the origin of individual audio source, such as the location of a trumpet within an orchestra. Several loudspeaker set-ups are fairly common and can create different spatial impressions. Without using special post-production techniques, the commonly known two-channel stereo set-ups can only recreate auditory events on a line between the two loudspeakers. This is mainly achieved by so-called “amplitude-panning”, where the amplitude of the signal associated to one audio source is distributed between the two loudspeakers, depending on the position of the audio source with respect to the loudspeakers. This is usually done during recording or subsequent mixing. That is, an audio source coming from the far-left with respect to the listening position will be mainly reproduced by the left loudspeaker, whereas an audio source in front of the listening position will be reproduced with identical amplitude (level) by both loudspeakers. However, sound emanating from other directions cannot be reproduced.
Consequently, by using more loudspeakers that are positioned around the listener, more directions can be covered and a more natural spatial impression can be created. The probably most well known multi-channel loudspeaker layout is the 5.1 standard (ITU-R775-1), which consists of 5 loudspeakers, whose azimuthal angles with respect to the listening position are predetermined to be 0°, ±30° and ±110°. That means, that during recording or mixing the signal is tailored to that specific loudspeaker configuration and deviations of a reproduction set-up from the standard will result in decreased reproduction quality.
Numerous other systems with varying numbers of loudspeakers located at different directions have also been proposed. Professional and special systems, especially in theaters and sound installations, also include loudspeakers at different heights.
According to the different reproduction set-ups, several different recording methods have been designed and proposed for the previously mentioned loudspeaker systems, in order to record and reproduce the spatial impression in the listening situation as it would have been perceived in the recording environment. A theoretically ideal way of recording spatial sound for a chosen multi-channel loudspeaker system would be to use the same number of microphones as there are loudspeakers. In such a case, the directivity patterns of the microphones should also correspond to the loudspeaker layout, such that sound from any single direction would only be recorded with a small number of microphones (1, 2 or more). Each microphone is associated to a specific loudspeaker. The more loudspeakers are used in reproduction, the narrower the directivity patterns of the microphones have to be. However, narrow directional microphones are rather expensive and typically have a non-flat frequency response, degrading the quality of the recorded sound in an undesirable manner. Furthermore, using several microphones with too broad directivity patterns as input to multi-channel reproduction results in a colored and blurred auditory perception due to the fact that sound emanating from a single direction would be reproduced with more loudspeakers than necessary, as it would be recorded with microphones associated to different loudspeakers. Generally, currently available microphones are best suited for two-channel recordings and reproductions, that is, these are designed without the goal of a reproduction of a surrounding spatial impression.
From the point of view of microphone-design, several approaches have been discussed to adapt the directivity patterns of microphones to the demands in spatial-audio-reproduction. Generally, all microphones capture sound differently depending on the direction of arrival of the sound to the microphone. That is, microphones have a different sensitivity, depending on the direction of arrival of the recorded sound. In some microphones, this effect is minor, as they capture sound almost independently of the direction. These microphones are generally called omnidirectional microphones. In a typical microphone design, a circular diaphragm is attached to a small airtight enclosure. If the diaphragm is not attached to the enclosure and sound reaches it equally from each side, its directional pattern has two lobes. That is, such a microphone captures sound with equal sensitivity from both front and back of the diaphragm, however, with inverse polarities. Such a microphone does not capture sound coming from the direction coincident to the plane of the diaphragm, i.e. perpendicular to the direction of maximum sensitivity. Such a directional pattern is called dipole, or figure-of-eight.
Omnidirectional microphones may also be modified into directional microphones, using a non-airtight enclosure for the microphone. The enclosure is especially constructed such, that the sound waves are allowed to propagate through the enclosure and reach the diaphragm, wherein some directions of propagation are advantageous, such that the directional pattern of such a microphone becomes a pattern between omnidirectional and dipole. Those patterns may, for example, have two lobes. However, the lobes may have different strength. Some commonly known microphones have patterns that have only one single lobe. The most important example is the cardioid pattern, where the directional function D can be expressed as D=1+cos(θ), θ being the direction of arrival of sound. The directional function thus quantifies, what fraction of the incoming sound amplitude is captured, depending on the direction.
The previously discussed omnidirectional patterns are also called zeroth-order patterns and the other patterns mentioned previously (dipole and cardioid) are called first-order patterns. All previously discussed microphone designs do not allow arbitrary shaping of the directivity patterns, since their directivity pattern is entirely determined by their mechanical construction.
To partly overcome this problem, some specialized acoustical structures have been designed, which can be used to create narrower directional patterns than those of first-order microphones. For example, when a tube with holes in it is attached to an omnidirectional microphone, a microphone with narrow directional pattern can be created. These microphones are called shotgun or rifle microphones. However, they typically do not have a flat frequency response, that is, the directivity pattern is narrowed at the cost of the quality of the recorded sound. Furthermore, the directivity pattern is predetermined by the geometric construction and, thus, the directivity pattern of a recording performed with such a microphone cannot be controlled after the recording.
Therefore, other methods have been proposed to partly allow to alter the directivity pattern after the actual recording. Generally, this relies on the basic idea of recording sound with an array of omnidirectional or directional microphones and to apply signal processing afterwards. Various such techniques have been recently proposed. A fairly simple example is to record sound with two omnidirectional microphones, which are placed close to each other, and to subtract both signals from each other. This creates a virtual microphone signal having a directional pattern equivalent to a dipole.
In other, more sophisticated schemes the microphone signals can also be delayed or filtered before summing them up. Using beam forming, a technique also known from wireless LAN, a signal corresponding to a narrow beam is formed by filtering each microphone signal with a specially designed filter and summing the signals up after the filtering (filter-sum beam forming). However, these techniques are blind to the signal itself, that is, they are not aware of the direction of arrival of the sound. Thus, a predetermined directional pattern may be defined, which is independent of the actual presence of a sound source in the predetermined direction. Generally, estimation of the “direction of arrival” of sound is a task of its own.
Generally, numerous different spatial directional characteristics can be formed with the above techniques. However, forming arbitrary spatially selective sensitivity patterns (i.e. forming narrow directional patterns) necessitates a large number of microphones.
An alternative way to create multi-channel recordings is to locate a microphone close to each sound source (e.g. an instrument) to be recorded and recreate the spatial impression by controlling the levels of the close-up microphone signals in the final mix. However, such a system demands a large number of microphones and a lot of user interaction in creating the final down-mix.
A method to overcome the above problem has been recently proposed and is called directional audio coding (DirAC), which may be used with different microphone systems and which is able to record sound for reproduction with arbitrary loudspeaker set-ups. The purpose of DirAC is to reproduce the spatial impression of an existing acoustical environment as precisely as possible, using a multi-channel loudspeaker system having an arbitrary geometrical set-up. Within the recording environment, the responses of the environment (which may be continuous recorded sound or impulse responses) are measured with an omnidirectional microphone (W) and with a set of microphones allowing to measure the direction of arrival of sound and the diffuseness of sound. In the following paragraphs and within the application, the term “diffuseness” is to be understood as a measure for the non-directivity of sound. That is, sound arriving at the listening or recording position with equal strength from all directions, is maximally diffuse. A common way of quantifying diffusion is to use diffuseness values from the interval [0, . . . , 1], wherein a value of 1 describes maximally diffuse sound and a value of 0 describes perfectly directional sound, i.e. sound arriving from one clearly distinguishable direction only. One commonly known method of measuring the direction of arrival of sound is to apply 3 figure-of-eight microphones (XYZ) aligned with Cartesian coordinate axes. Special microphones, so-called “SoundField microphones”, have been designed, which directly yield all desired responses. However, as mentioned above, the W, X, Y and Z signals may also be computed from a set of discrete omnidirectional microphones.
In DirAC analysis, a recorded sound signal is divided into frequency channels, which correspond to the frequency selectivity of human auditory perception. That is, the signal is, for example, processed by a filter bank or a Fourier-transform to divide the signal into numerous frequency channels, having a bandwidth adapted to the frequency selectivity of the human hearing. Then, the frequency band signals are analyzed to determine the direction of origin of sound and a diffuseness value for each frequency channel with a predetermined time resolution. This time resolution does not have to be fixed and may, of course, be adapted to the recording environment. In DirAC, one or more audio channels are recorded or transmitted, together with the analyzed direction and diffuseness data.
In synthesis or decoding, the audio channels finally applied to the loudspeakers can be based on the omnidirectional channel W (recorded with a high quality due to the omnidirectional directivity pattern of the microphone used), or the sound for each loudspeaker may be computed as a weighted sum of W, X, Y and Z, thus forming a signal having a certain directional characteristic for each loudspeaker. Corresponding to the encoding, each audio channel is divided into frequency channels, which are optionally furthermore divided into diffuse and non-diffuse streams, depending on analyzed diffuseness. If diffuseness has been measured to be high, a diffuse stream may be reproduced using a technique producing a diffuse perception of sound, such as the decorrelation techniques also used in Binaural Cue Coding. Non-diffuse sound is reproduced using a technique aiming to produce a point-like virtual audio source, located in the direction indicated by the direction data found in the analysis, i.e. the generation of the DirAC signal. That is, spatial reproduction is not tailored to one specific, “ideal” loudspeaker set-up, as in the conventional techniques (e.g. 5.1). This is particularly the case, as the origin of sound is determined as direction parameters (i.e. described by a vector) using the knowledge about the directivity patterns on the microphones used in the recording. As already discussed, the origin of sound in 3-dimensional space is parameterized in a frequency selective manner. As such, the directional impression may be reproduced with high quality for arbitrary loudspeaker set-ups, as far as the geometry of the loudspeaker set-up is known. DirAC is therefore not limited to special loudspeaker geometries and generally allows for a more flexible spatial reproduction of sound.
Although numerous techniques have been developed to reproduce multi-channel audio recordings and to record appropriate signals for a later multi-channel reproduction, none of the conventional techniques allows to influence an already recorded signal such that a direction of origin of audio signals can be emphasized during reproduction such that, for example, the intelligibility of the signal from one distinct desired direction may be enhanced.