1. Field of the Invention
This invention relates to the reproduction of spatialised audio in immersive environments with non-ideal acoustic conditions.
2. Related Art
Immersive environments are expected to be an important component of future communication systems. An immersive environment is one in which the user is given the sensation of being located within an environment depicted by the system, rather than observing it from the exterior as he would with a conventional flat screen such as a television. This xe2x80x9cimmersionxe2x80x9d allows the user to be more fully involved with the subject material. For the visual sense, an immersive environment can be created by arranging that the whole of the user""s field of vision is occupied with a visual presentation giving an impression of three dimensionality and allowing the user to perceive complex geometry.
For the immersive effect to be realistic, the user must receive appropriate inputs to all the senses which contribute to the effect. In particular, the use of combined audio and video is an important aspect of most immersive environments: see for example:
ANDERSON. D. and CASEY. M. xe2x80x9cVirtual worldsxe2x80x94The sound dimensionxe2x80x9d IEEE Spectrum 1997, Vol. 34, No 3, pp 46-50:
BRAHAM. R. and COMERFORD. R. xe2x80x9cSharing virtual worldsxe2x80x9d IEEE Spectrum 1997, Vol. 34, No 3, pp 18-20
WATERS. R and BARRUS. J xe2x80x9cThe rise of shared virtual environmentsxe2x80x9d IEEE Spectrum 1997, Vol. 34, No 3, pp 20-25.
Spatialised audio, the use of two or more loudspeakers to generate an audio effect perceived by the listener as emanating from a source spaced from the loudspeakers, is well-known. In its simplest form, stereophonic effects have been used in audio systems for several decades. In this specification the term xe2x80x9cvirtualxe2x80x9d sound source is used to mean the apparent source of a sound, as perceived by the listener, as distinct from the actual sound sources, which are the loudspeakers.
Immersive environments are being researched for use in Telepresence, teleconferencing, xe2x80x9cflying throughxe2x80x9d architect""s plans, education and medicine. The wide field of vision, combined with spatialised audio, create a feeling of xe2x80x9cbeing therexe2x80x9d which aids the communication process, and the additional sensation of size and depth can provide a powerful collaborative design space.
Several examples of immersive environment are described by D. M. Traill, J. J. Bowskill and P. J. Lawrence in xe2x80x9cInteractive Collaborative Media Environmentsxe2x80x9d (British Telecommunications Technology Journal Vol. 15, No. 4 (October 1997), pages 130 to 139. One example of an immersive environment is the BT/ARC VisionDome, (described on pages 135 to 136 and FIG. 7 of that article), in which the visual image is presented on a large concave screen with the users inside (see FIGS. 1 and 2). A multi-channel spatialised audio system having eight loudspeakers is used to provide audio immersion. Further description may be found at:
A second example is the xe2x80x9cSmartSpacexe2x80x9d chair described on pages 134 and 135 (and FIG. 6) of the same article, which combines a wide-angle video screen, a computer terminal and spatialised audio, all arranged to move with the rotation of a swivel chairxe2x80x94a system currently under development by British Telecommunications plc. Rotation of the chair causes the user""s orientation in the environment to change, the visual and audio inputs being modified accordingly. The SmartSpace chair uses transaural processing, as described by COOPER. D. and BAUCK. J. xe2x80x9cProspects for transaural recordingxe2x80x9d, Journal of the Audio Engineering Society 1989, Vol. 37, No 1/2, pp 3-19, to provide a xe2x80x9csound bubblexe2x80x9d around the user, giving him the feeling of complete audio immersion, while the wrap-around screen provides visual immersion.
Where the immersive environment is interactive, images and spatialised sound are generated in real-time (typically as a computer animation), while non-interactive material is often supplied with an ambisonic B-Format sound track, the characteristics of which are to be described later in this specification. Ambisonic coding is a popular choice for immersive audio environments as it is possible to decode any number of channels using only three or four transmission channels. However, ambisonic technology has its limitations when used in telepresence environments, as will be discussed.
Several issues regarding sound localisation in immersive environments will now be considered. FIGS. 1 and 2 show a plan view and side cross section of the VisionDome, with eight loudspeakers (1, 2, 3, 4, 5, 6, 7, 8), the wrap-around screen, and typical user positions marked. Multi-channel ambisonic audio tracks are typically reproduced in rectangular listening rooms. When replayed in a hemispherical dome, spatialisation is impaired by the geometry of the listening environment. Reflections within the hemisphere can destroy the sound-field recombination: although this can sometimes be minimised by treating the wall surfaces with a suitable absorptive material, this may not always be practical. The use of a hard plastic dome as a listening room creates many acoustic problems mainly caused by multiple reflections. The acoustic properties of the dome, if left untreated, cause sounds to seem as if they originate from multiple sources and thus the intended sound spatialisation effect is destroyed. One solution is to cover the inside surface of the dome with an absorbing material which reduces reflections. The material of the video screen itself is sound absorbent, so it assists in the reduction of sound reflections but it also causes considerable high-frequency attenuation to sounds originating from loudspeakers located behind the screen. This high-frequency attenuation is overcome by applying equalisation to the signals fed into the loudspeakers 1, 2, 3, 7, 8 located behind the screen.
Listening environments other than a plastic dome have their own acoustic properties and in most cases reflections will be a cause of error. As with a dome, the application of acoustic tiles will reduce the amount of reflections, thereby increasing the users"" ability to accurately localise audio signals.
Most projection screens and video monitors have a flat (or nearly flat) screen. When a pre-recorded B-Format sound track is composed to match a moving video image, it is typically constructed in studios with such flat video screens. To give the correct spatial percept (perceived sound field) the B-Format coding used thus maps the audio to the flat video screen. However, when large multi-user environments, such as the VisionDome, are used, the video is replayed on a concave screen, the video image being suitably modified to appear correct to an observer. However, the geometry of the audio effect is no longer consistent with the video and a non-linear mapping is required to restore the perceptual synchronisation. In the case of interactive material, the B-Format coder locates the virtual source onto the circumference of a unit circle thus mapping the curvature of the screen.
In environments where a group of listeners are situated in a small area an ambisonic reproduction system is likely to fail to produce the desired auditory spatialisation for most of them. One reason is that the various sound fields generated by the loudspeakers only combine correctly to produce the desired effect of a xe2x80x9cvirtualxe2x80x9d sound source at one position, known as the xe2x80x9csweet-spotxe2x80x9d. Only one listener (at most) can be located in the precise sweet-spot. This is because the true sweet-spot, where in-phase and anti-phase signals reconstruct correctly to give the desired signal, is a small area and participants outside the sweet-spot receive an incorrect combination of in-phase and anti-phase signals. Indeed, for a hemispherical screen, the video projector is normally at the geometric centre of the hemisphere, and the ambisonics are generally arranged such that the xe2x80x9csweet spotxe2x80x9d is also at the geometric centre of the loudspeaker array, which is arranged to be concentric with the screen. Thus, there can be no-one at the actual xe2x80x9csweet spotxe2x80x9d since that location is occupied by the projector.
The effect of moving the sweet-spot to coincide with the position of one of the listeners has been investigated by BURRASTON, HOLLIER and HAWKSFORD (xe2x80x9cLimitations of dynamically controlling the listening position in a 3-D ambisonic environmentxe2x80x9d Preprint from 102nd AES Convention March 1997 Audio Engineering Society (Preprint No 4460)). This enables a listener not located in the original sweet-spot to receive the correct combination of ambisonic decoded signals. However, this system is designed only for single users as the sweet-spot can only be moved to one position at a time. The paper discusses the effects of a listener being positioned outside the sweet-spot (as would happen with a group of users in a virtual meeting place) and, based on numerous formal listening tests, concludes that listeners can correctly localise the sound only when they are located on the sweet-spot.
When a sound source is moving, and the listener is in a non-sweet-spot position, interesting effects are noted. Consider an example where the sound moves from front right to front left and the listener is located off-centre and close to the front. The sound initially seems to come from the right loudspeaker, remains there for a while and then moves quickly across the centre to the left loudspeakerxe2x80x94sounds tend to xe2x80x9changxe2x80x9d around the loudspeakers causing an acoustically hollow centre area or xe2x80x9cholexe2x80x9d. For listeners not located at the sweet spot, any virtual sound source will generally seem to be too close to one of the loudspeakers. If it is moving smoothly through space (as perceived by a listener at the sweet spot), users not at the sweet spot will perceive the virtual source staying close to one loudspeaker location, and then suddenly jumping to another loudspeaker.
The simplest method of geometric co-ordinate correction involves warping the geometric positions of the loudspeakers when programming loudspeaker locations into the ambisonic decoder. The decoder is programmed for loudspeaker positions closer to the centre than their actual positions: this results in an effect in which the sound moves quickly at the edges of the screen and slowly around the centre of the screenxe2x80x94resulting in a perceived linear movement of the sound with respect to an image on the screen. This principle can only be applied to ambisonic decoders which are able to decode the B-Format signal to selectable loudspeaker positions, i.e. it can not be used with decoders designed for fixed loudspeaker positions (such as the eight corners of a cube or four corners of a square).
A non-linear panning strategy has been developed which takes as its input the monophonic sound source, the desired sound location (x,y,z) and the locations of the N loudspeakers in the reproduction system (x,y,z). This system can have any number of separate input sources which can be individually localised to separate points in space. A virtual sound source is panned from one position to another with a non-linear panning characteristic. The non-linear panning corrects the effects described above, in which an audio xe2x80x9cholexe2x80x9d is perceived. The perceptual experience is corrected to give a linear audio trajectory from original to final location. The non-linear panning scheme is based on intensity panning and not wavefront reconstruction as in an ambisonic system. Because the warping is based on intensity panning there is no anti-phase signal from the other loudspeakers and hence with a multi-user system all of the listeners will experience correctly spatialised audio. The non-linear warping algorithm is a complete system (i.e. it takes a signal""s co-ordinates and positions it in 3-dimensional space), so it can only be used for real-time material and not for warping ambisonic recordings.
According to the present invention, there is provided a method of generating a sound field from an array of loudspeakers, the array defining a listening space wherein the outputs of the loudspeakers combine to give a spatial perception of a virtual sound source, the method comprising the generation, for each loudspeaker in the array, of a respective output component Pn for controlling the output of the respective loudspeaker, the output being derived from data carried in an input signal, the data comprising a sum reference signal W, and directional sound components X, Y, (Z) representing the sound component in different directions as produced by the virtual sound source, wherein the method comprises the steps of recognising, for each loudspeaker, whether the respective component Pn is changing in phase or antiphase to the sum reference signal W, modifying said signal if it is in antiphase, and feeding the resulting modified components to the respective loudspeakers.
According to a second aspect of the invention, there is provided apparatus for generating a sound field, comprising an array of loudspeakers defining a listening space wherein the outputs of the loudspeakers combine to give a spatial perception of a virtual sound source, means for receiving and processing data carried in an input signal, the data comprising a sum reference signal W, and directional information components X, Y, (Z) indicative of the sound in different directions as produced by the virtual sound source, means for the generation from said data of a respective output component Pn for controlling the output of each loudspeaker in the array, means for recognising, for each loudspeaker, whether the respective component Pn is changing in phase or antiphase to the sum reference signal W, means for modifying said signal if it is in antiphase, and means for feeding the resulting modified components to the respective loudspeakers.
Preferably the directional sound components are each multiplied by a warping factor which is a function of the respective directional sound component, such that a moving virtual sound source following a smooth trajectory as perceived by a listener at any point in the listening field also follows a smooth trajectory as perceived at any other point in the listening field. This ensures that virtual sound sources do not tend to occur in certain regions of the listening field more than others. The warping factor may be a square or higher even-numbered power, or a sinusoidal function, of the directional sound component.
The ambisonic B-Format coding and decoding equations for 2-dimensional reproduction systems will now be briefly discussed. This section does not discuss the detailed theory of ambisonics but states the results of other researchers in the field. Ambisonic theory presents a solution to the problem of encoding directional information into an audio signal. The signal is intended to be replayed over an array of at least four loudspeakers (for a pantophonicxe2x80x94horizontal planexe2x80x94system) or eight loudspeakers (for a periphonicxe2x80x94horizontal and vertical planexe2x80x94system). The signal, termed xe2x80x9cB-Formatxe2x80x9d consists (for the first order case) of three components for pantophonic systems (W,X,Y) and four components for periphonic systems (W,X,Y,Z). For a detailed analysis of surround sound and ambisonic theory, see:
BAMFORD. J. and VANDERKOOY. J. xe2x80x9cAmbisonic sound for usxe2x80x9d Preprint from 99th AES Convention October 1995 Audio Engineering Society (Preprint No 4138)
BEGAULT. D. xe2x80x9cChallenges to the successful implementation of 3-D soundxe2x80x9d Journal of the Audio Engineering Society 1991, Vol. 39, No 11, pp 864-870
BURRASTON et al (referred to above)
GERZON. M. xe2x80x9cOptimum reproduction matrices for multi-speaker stereoxe2x80x9d Journal of the Audio Engineering Society 1992, Vol. 40, No 7/8, pp 571-589
GERZON. M. xe2x80x9cSurround sound psychoacousticsxe2x80x9d Wireless World December 1974, Vol. 80, pp 483-485
MALHAM. D. G xe2x80x9cComputer control of ambisonic soundfieldsxe2x80x9d Preprint from 82nd AES Convention March 1987 Audio Engineering Society (Preprint No 2463)
MALHAM. D. G. and CLARKE. J. xe2x80x9cControl software for a programmable soundfield controllerxe2x80x9d Proceedings of the Institute of Acoustics Autumn Conference on Reproduced Sound 8, Windermere 1992, pp 265-272
MALHAM. D. G. and MYATT. A. xe2x80x9c3-D Sound spatialisation using ambisonic techniquesxe2x80x9d Computer Music Journal 1995, Vol. 19 No 4, pp 58-70
POLETTI. M. xe2x80x9cThe design of encoding functions for stereophonic and polyphonic sound systemsxe2x80x9d Journal of the Audio Engineering Society 1996, Vol. 44, No 11, pp 948-963
VANDERKOOY. J. and LIPSHITZ. S. xe2x80x9cAnomalies of wavefront reconstruction in stereo and surround-sound reproductionxe2x80x9d Preprint from 83rd AES Convention October 1987 Audio Engineering Society (Preprint No 2554)
The ambisonic systems herein described are all first order, i.e. m=1 where the number of channels is given by 2m+1 for a 2-dimensional system (3 channels: w,x,y) and (m+1)2 for a 3-dimensional system (4 channels: w,x,y,z). In this specification only two-dimensional systems will be considered, however the ideas presented here may readily be scaled for use with a full three-dimensional reproduction system, and the scope of the claims embraces such systems.