There is an increasing need for new technologies and innovative products in the area of entertainment electronics. It is an important prerequisite for the success of new multimedia systems to offer optimal functionalities or capabilities. This is achieved by the employment of digital technologies and, in particular, computer technology. Examples for this are the applications offering an enhanced close-to-reality audiovisual impression. In previous audio systems, a substantial disadvantage lies in the quality of the spatial sound reproduction of natural, but also of virtual environments.
Methods of multi-channel loudspeaker reproduction of audio signals have been known and standardized for many years. All usual techniques have the disadvantage that both the site of the loudspeakers and the position of the listener are already impressed on the transfer format. With wrong arrangement of the loudspeakers with reference to the listener, the audio quality suffers significantly. Optimal sound is only possible in a small area of the reproduction space, the so-called sweet spot.
A better natural spatial impression as well as greater enclosure or envelope in the audio reproduction may be achieved with the aid of a new technology. The principles of this technology, the so-called wave field synthesis (WFS), have been studied at the TU Delft and first presented in the late 80s (Berkout, A. J.; de Vries, D.; Vogel, P.: Acoustic Control by Wave Field Synthesis. JASA 93, 993).
Due to this method's enormous demands on computer power and transfer rates, the wave field synthesis has up to now only rarely been employed in practice. Only the progress in the area of the microprocessor technology and the audio encoding do permit the employment of this technology in concrete applications today.
The basic idea of WFS is based on the application of Huygens' principle of the wave theory. Each point caught by a wave is starting point of an elementary wave propagating in spherical or circular manner.
Applied on acoustics, every arbitrary shape of an incoming wave front may be replicated by a large amount of loudspeakers arranged next to each other (a so-called loudspeaker array). In the simplest case, a single point source to be reproduced and a linear arrangement of the loudspeakers, the audio signals of each loudspeaker have to be fed with a time delay and amplitude scaling so that the radiating sound fields of the individual loudspeakers overlay correctly. With several sound sources, for each source the contribution to each loudspeaker is calculated separately and the resulting signals are added. If the sources to be reproduced are in a room with reflecting walls, reflections also have to be reproduced via the loudspeaker array as additional sources. Thus, the expenditure in the calculation strongly depends on the number of sound sources, the reflection properties of the recording room, and the number of loudspeakers.
In particular, the advantage of this technique is that a natural spatial sound impression across a great area of the reproduction space is possible. In contrast to the known techniques, direction and distance of sound sources are reproduced in a very exact manner. To a limited degree, virtual sound sources may even be positioned between the real loudspeaker array and the listener.
Although the wave field synthesis functions are well for environments the properties of which are known, irregularities occur if the property changes or the wave field synthesis is executed on the basis of an environment property not matching the actual property of the environment.
The technique of the wave field synthesis, however, may also be advantageously employed to supplement a visual perception by a corresponding spatial audio perception. Previously, in the production in virtual studios, the conveyance of an authentic visual impression of the virtual scene was in the foreground. The acoustic impression matching the image is usually impressed on the audio signal by manual steps in the so-called postproduction afterwards or classified as too expensive and time-intensive in the realization and thus neglected. Thereby, usually a contradiction of the individual sensations arises, which leads to the designed space, i.e. the designed scene, to be perceived as less authentic.
In the technical publication “Subjective experiments on the effects of combining spatialized audio and 2D video projection in audio-visual systems”, W. de Bruijn and M. Boone, AES convention paper 5582, May 10 to 13, 2002, Munich, subjective experiments with reference to effects of combining spatial audio and a two-dimensional video projection in audiovisual systems are illustrated. In particular, it is stressed that two speakers standing at differing distance to a camera and almost standing behind each other can be better understood by a viewer if the two people standing behind each other are seen and reconstructed as different virtual sound sources with the aid of the wave field synthesis. In this case, by subjective tests, it has turned out that a listener can better understand and distinguish the two speakers, who are talking at the same time, separately from each other.
In a conference contribution to the 46th international scientific colloquium in Ilmenau from Sep. 24 to 27, 2001, entitled “Automatisierte Anpas sung der Akustik an virtuelle Räume”, U. Reiter, F. Melchior, and C. Seidel, an approach to automate tone postproduction processes is presented. To this end, the parameters of a film set that may be used for the visualization, such as room size, texture of the surfaces or camera position, and position of the actors, are checked for their acoustic relevance, whereupon corresponding control data is generated. This then influences, in automated manner, the effect and postproduction processes employed for postproduction, such as the adaptation of the speaker volume dependence on the distance to the camera, or the reverberation time in dependence on room size and wall texture. Here, the aim is to increase the visual impression of a virtual scene for heightened perception of reality.
“Hearing with the ears of the camera” is to be enabled, in order to make a scene appear more real. Here, an as high as possible correlation between sound event location in the picture and hearing event location in the surround field is strived for. This means that sound source positions are supposed to be adapted to the picture. Camera parameters, such as zoom, are also to be included into the tone design, just as a position of two loudspeakers L and R. To this end, tracking data of a virtual studio are written into a file together with an accompanying time code by the system. At the same time, picture, tone, and time code are recorded on a MAZ. The camdump file is transferred to a computer generating control data for an audio workstation therefrom and outputting it synchronously to the picture originating from the MAZ via a MIDI interface. The actual audio processing, such as positioning of the sound source in the surround field and inserting early reflections and reverberation, takes place within the audio workstation. The signal is rendered for a 5.1 surround loudspeaker system.
Camera tracking parameters, just like positions of sound sources in the capture setting, may be recorded in real movie sets. Such data may also be generated in virtual studios.
In a virtual studio, an actor or presenter stands alone in a recording room. In particular, he or she stands in front of a blue wall, also referred to as blue box or blue panel. Onto this blue wall, a pattern of blue and light-blue strips is applied. The special thing about this pattern is that the strips are of different width, and thus a multiplicity of strip combinations result. Due to the unique strip combinations on the blue wall, in postproduction, when the blue wall is replaced by a virtual background, it is possible to exactly determine in which direction the camera is looking. With the aid of this information, the computer may determine the background for the current camera viewing angle. Furthermore, sensors from the camera sensing and outputting additional camera parameters are evaluated. Typical parameters of a camera sensed by means of sensors are the three degrees of translation x, y, z, the three degrees of rotation, also referred to as roll, tilt, pan, and the focal length or zoom, which is of equal meaning with the information on the aperture angle of the camera.
So that the exact position of the camera may also be determined without image recognition and without expensive sensor technology, also a tracking system may be employed, which consists of several infrared cameras determining the position of an infrared sensor mounted to the camera. Thus, also the position of the camera is determined. With the camera parameters provided by the sensor technology and the strip information evaluated by the image recognition, a real-time computer may now compute the background for the current picture. Hereupon, the blue hue, which the blue background had, is removed from the picture, so that the virtual background is played in instead of the blue background.
In the majority of cases, a concept is followed, in which it is all about getting an acoustic overall impression of the visually imaged scenery. This may be well described with the term of the “full shot” originating from image design. This “full shot” sound impression mostly remains constant over all shots in a scene, although the optical angle of view on the things mostly changes strongly. Thus, optical details are highlighted by corresponding shots or put to the background. Counter shots in the movie dialog design are also not reenacted by the tone.
Hence, there is the need to acoustically embed the viewer into an audiovisual scene. Here, the screen or image area forms the viewing direction and the angle of view of the viewer. This means that the tone is to track the image in the form that it matches the scene image. In particular, this becomes even more important for virtual studios, since there is typically no correlation between the tone of, for example, the presentation and the surrounding in which the presenter currently is. In order to get an audiovisual overall impression of the scene, a spatial impression matching the image rendered has to be simulated. A substantial subjective property in such a sound concept in this connection is the location of a sound source, as a viewer of a movie screen perceives it, for example.
In the audio field, by the technique of the wave field synthesis (WFS), good spatial sound for a large listener area can be accomplished. As it has been set forth, the wave field synthesis is based on the Huygens principle, according to which wave fronts may be shaped and built up by superimposition of elementary waves. According to a mathematically exact, theoretical description, an infinite number of sources in infinitely small distance would have to be used for the generation of the elementary waves. In practice, however, a finite number of loudspeakers is used in a finite, small distance to each other. Each of these loudspeakers is controlled with an audio signal from a virtual source having a certain delay and a certain level, according to the WFS principle. Levels and delays are usually different for all loudspeakers.
At is has already been set forth, the wave field synthesis system works on the basis of the Huygens principle and reconstructs a given waveform, for example, of a virtual source arranged at a certain distance to a presentation area or a listener in the presentation area by a multiplicity of individual waves. The wave field synthesis algorithm thus obtains information on the actual position of an individual loudspeaker from the loudspeaker array to then calculate, for this individual loudspeaker, a component signal this loudspeaker then finally has to irradiate, so that a superimposition of the loudspeaker signal from the one loudspeaker with the loudspeaker signals of the other active loudspeakers performs a reconstruction in that the listener has the impression that he or she is not “irradiated with sound” by many individual loudspeakers, but only by a single loudspeaker at the position of the virtual source.
For several virtual sources in a wave field synthesis setting, the contribution of each virtual source for each loudspeaker, i.e. the component signal of the first virtual source for the first loudspeaker, of the second virtual source for the first loudspeaker, etc., is calculated to then add the component signals to finally obtain the actual loudspeaker signal. In case of, for example, three virtual sources, the superimposition of the loudspeaker signals of all active loudspeakers at the listener would lead to the listener not having the impression that he or she is irradiated with sound from a large array of loudspeakers, but that the sound he or she is hearing only comes from three sound sources positioned at special positions, which are equal to the virtual sources.
In practice, the calculation of the component signals mostly takes place by the audio signal associated with a virtual source being imparted with a delay and a scaling factor at a certain time instant, depending on position of the virtual source and position of the loudspeaker, in order to obtain a delayed and/or scaled audio signal of the virtual source, which immediately represents the loudspeaker signal, when only one virtual source is present, or which then contributes to the loudspeaker signal for the loudspeaker considered, after addition with further component signals for the loudspeaker considered from other virtual sources.
Typical wave field synthesis algorithms work independently of how many loudspeakers are present in the loudspeaker array. The theory underlying the wave field synthesis consists in the fact that each arbitrary sound field may be exactly reconstructed by an infinitely high number of individual loudspeakers, the individual loudspeakers being arranged infinitely close to each other. In practice, however, neither the infinitely high number nor the infinitely close arrangement can be realized. Instead, there are a limited number of loudspeakers, which are additionally arranged in certain given distances to each other. With this, in real systems, only an approximation is achieved to the actual waveform that would take place if the virtual source was actually present, i.e. was a real source.
Furthermore, there are various scenarios in that the loudspeaker array, when considering a movie theater, is only arranged, for example, on the side of the movie screen. In this case, the wave field synthesis module would generate loudspeaker signals for these loudspeakers, wherein the loudspeaker signals for these loudspeakers will normally be the same as for corresponding loudspeakers in a loudspeaker array not only extending across the side of a movie theater, for example, on which the screen is arranged, but which is also arranged to the left, to the right, and behind the audience room. This “360°” loudspeaker array will of course provide a better approximation to an exact wave field than only a one-sided array, for example in front of the viewers. Nevertheless, the loudspeaker signals for the loudspeakers that are in front of the viewers are the same in both cases. This means that a wave field synthesis module typically does not obtain feedback as to how many loudspeakers are present or whether it is a one-sided or multi-sided or even a 360° array or not. In other words, a wave field synthesis means calculates a loudspeaker signal for a loudspeaker due to the position of the loudspeaker and independent of the fact which further loudspeakers are also present or not present.
For example, the U.S. Pat. No. 7,684,578 describes a wave field synthesis apparatus for a reduction of artifacts by supplying not all loudspeakers of the loudspeaker array with drive signal components. It shows the determination of relevant loudspeakers and a calculation of drive signal components only for the relevant loudspeakers.
In general, the reduction or elimination of artifacts caused by different effects is very important.