1. Field of the Invention
The present invention relates to audio reproduction techniques and, in particular, to audio reproduction techniques which are suitable for wave-field synthesis modules to conduct a source of sound positioning tuned to a video reproduction.
2. Description of Prior Art
There is an increasing need for new technologies and innovative products in the area of consumer electronics. It is an important prerequisite for the success of new multimedia systems to offer optimal functionalities or capabilities. This is achieved by the employment of digital technologies and, in particular, computer technology. Examples hereof are the applications offering an enhanced close-to-reality audiovisual impression. In previous audio systems, a substantial disadvantage has been the quality of the spatial sound reproduction of natural, but also of virtual environments.
Methods of multi-channel loudspeaker reproduction of audio signals have been known and standardized for many years. All common techniques have the disadvantage that both the site of the loudspeakers and the position of the listener are already impressed on the transfer format. With incorrect arrangement of the loudspeakers with reference to the listener, audio quality suffers significantly. Optimum sound is only possible in a small area of the reproduction space, the so-called sweet spot.
A better natural spatial impression as well as a more pronounced enclosure, or enveloping, with audio reproduction may be achieved with the aid of a new technology. The principles of this technology, the so-called wave-field synthesis (WFS), have been studied at the TU Delft and first presented in the late 80s (Berkhout, A. J.; de Vries, D.; Vogel, P.: Acoustic control by Wave-field Synthesis. JASA 93, 993).
Due to this method's enormous requirements for computer power and transfer rates, wave-field synthesis has up to now only rarely been employed in practice. Only the progress achieved in the areas of microprocessor technology and audio encoding today permit the employment of this technology in concrete applications. First products in the professional area are expected for next year. In a few years' time, first wave-field synthesis applications for the consumer area are also supposed to come on the market.
The basic idea of WFS is based on the application of Huygens' principle of the wave theory:
Each point caught by a wave is a starting point of an elementary wave propagating in a spherical or a circular manner.
Applied to acoustics, every arbitrary shape of an incoming wave front may be replicated by a large amount of loudspeakers arranged next to one other (a so called loudspeaker array). In the simplest case, which includes a single point source to be reproduced and a linear arrangement of the loudspeakers, the audio signals of each loudspeaker have to be fed with a time delay and amplitude scaling so that the radiated sound fields of the individual loudspeakers overlay correctly. With several sources of sound, for each source the contribution to each loudspeaker is calculated separately, and the resulting signals are added. If the sources to be reproduced are in a room with reflecting walls, reflections also have to be reproduced via the loudspeaker array as additional sources. Thus, the calculation expenditure highly depends on the number of sources of sound, the reflection properties of the recording room, and the number of loudspeakers.
In particular, the advantage of this technique is that a natural spatial sound impression across a great area of the reproduction space is possible. In contrast to the known techniques, direction and distance of sources of sound are reproduced in a very exact manner. To a limited degree, virtual sources of sound may even be positioned between the real loudspeaker array and the listener.
Although wave-field synthesis functions well for environments whose conditions are known, irregularities occur if the conditions change or if wave-field synthesis is executed on the basis of an environmental condition which does not match the actual condition of the environment.
The technique of wave-field synthesis, however, may also be advantageously employed to supplement a visual perception by a corresponding spatial audio perception. Previously, in the production in virtual studios, emphasis has been placed on conveying an authentic visual impression of the virtual scene. The acoustic impression matching the image is usually subsequently impressed on the audio signal by manual steps during so-called post-production, or is classified as too expensive and time-consuming in its implementation, and is thus neglected. Thereby, usually a contradiction of the individual sensual perceptions arises, which leads to the designed space, i.e. the designed scene, being perceived as less authentic.
In the specialist publication “Subjective experiments on the effects of combining spatialized audio and 2D video projection in audio-visual systems”, W. de Bruijn and M. Boone, AES convention paper 5582, May 10 to 13, 2002, Munich, subjective experiments performed on the effects of combining spatial audio and a two-dimensional video projection in audio-visual systems are presented. It is emphasized, in particular, that two human speakers positioned almost one behind the other and at different distances from a camera, may be better understood by an observer if, by means of wave-field synthesis, the two persons, positioned one behind the other, are interpreted and reconstructed as different virtual sources of sound. In this case, subjective tests have revealed that it is easier for a listener to understand and differentiate between the two speakers, who are speaking at the same time, if they are separate from one another.
In a conference paper on the 46. international scientific colloquium in Ilmenau from Sep. 24 to 27, 2001, entitled “Automatisierte Anpassung der Akustik an virtuelle Raume”, U. Reiter, F. Melchior and C. Seidel, an approach to automating sound post-processing methods is presented. To this end, the parameters of a film set which are required for visualization, such as the size of the room, the texture of the surfaces or the positions of the camera and of the actors, are checked as to their acoustic relevance, whereupon corresponding control data is generated. This data then influences, in an automated manner, the effect and post-processing methods employed for postproduction, such as the adjustment of the dependence of the speaker's loudness, or volume, on the distance from the camera, or the reverberation time in dependence on the size of the room and the nature of the walls. Here, the goal is to reinforce the visual impression of a virtual scene for enhanced perception of reality.
The intention is to enable “hearing with the ears of the camera” to make a scene appear more authentic. What is strived for here is to achieve as high a correlation as possible between the sound event location in the image and the hearing event location in the surround field. This means that sound-source positions are supposed to be constantly adjusted to an image. Camera parameters, such as zoom, are to be integrated into sound production just as much as positions of two loudspeakers L and R. For this purpose, tracking data of a virtual studio is written into a file along with an associated time code of the system. At the same time, image, sound and time code are recorded on an MAZ. The camdump file is transmitted to a computer, which generates control data for an audio workstation therefrom, which it outputs via a MIDI interface in synchronicity with the image stemming from the MAZ. The actual audio processing, such as positioning of the source of sound in the surround field, and introducing early reflections and reverberation is performed in the audio workstation.The signal is rendered for a 5.1 surround loudspeaker system.
With real film sets, camera tracking parameters as well as positions of sources of sound in the recording setting may be recorded. Such data may also be generated in virtual studios.
In a virtual studio, an actor or presenter is on his/her own in a recording room. In particular, he/she is standing before a blue wall, which is also referred to as blue box or blue panel. A pattern of blue and light-blue stripes is applied to this blue wall. What is special about this pattern is that the stripes have varying widths, and that therefore, a multitude of stripe combinations result. The unique stripe combinations on the blue wall make it possible, in post-processing, when the blue wall is replaced by a virtual background, to determine which direction the camera is pointed at. With the aid of this information, the computer may determine the background for the current camera's angle of view. In addition, sensors provided on the camera, which detect and output additional camera parameters, are also evaluated. Typical parameters of a camera which are detected by means of sensor technology, are the three degrees of translation, x, y, z, the three degrees of rotation, also referred to as roll, tilt, and pan, and the focal length, or the zoom, which is equivalent to the information about the aperture angle of the camera.
So that the precise position of the camera can be determined even without picture recognition and without costly sensor technology, a tracking system may also be employed which consists of several infrared cameras determining the position of an infrared sensor attached to the camera. Thus, the position of the camera is determined as well. By means of the camera parameters provided by the sensor technology, and by means of the stripe information evaluated by the picture recognition, a real-time computer may now calculate the background for the current picture. Subsequently, the shade of blue exhibited by the blue background is removed from the picture, so that the blue background is replaced by the virtual background.
In most cases, a concept is adhered to which is about getting an overall acoustic impression of the scenes visually portrayed. This may be paraphrased by the term of the “total” stemming from the field of picture configuration. This “total” sound impression mostly remains constant across all camera positionings in a scene, even though the optical angle of view of the objects strongly varies in most cases.
Thus, optical details may or may not be emphasized, depending on corresponding positionings. Countershots conducted in the creation of cinematic dialogs are not copied by the sound.
Therefore, there is a need to acoustically involve the audience into an audio-visual scene. Here, the screen, or image area, forms the viewer's line of vision and angle of view. This means that the sound is to follow the image such that it always matches the image viewed. This is becoming more important, in particular, for virtual studios, since there is typically no correlation between the sound of, e.g., the presentation, or moderation, and the surroundings the presenter finds himself/herself in. To get an overall audio-visual impression of the scene, a spatial impression matching the image rendered must be simulated. An essential subjective characteristic with such a tonal concept is, in this context, the location of a source of sound as is perceived by a viewer of, e.g., a cinema screen.