Different standards have been adopted by the content industry in the context of multichannel sound production, distribution, and playback. The first standards were related to the implementation of monophonic sound systems, based on one single independent audio channel. Subsequent standards evolved to stereo systems, based on two independent audio channels, then to 5.1 and 7.1 channels, based on 6 and 8 independent audio channels respectively. In particular, the so-called 5.1 channel configuration has been adopted by a large portion of cinema theatres, and it has witnessed a considerable deployment in the home market. The natural evolution of these standards, achieved by the stepwise addition of audio channels, has led to, on one hand, consecutive enhancements in the spatial sound perception by the audience, and, on the other hand, in an increased creative freedom for content creators.
In an attempt to continue these enhancements both for content creators as well as content consumers, proposals have coexisted to adopt standards based on multichannel layouts with more and more independent audio channels, like the 10.2 system proposed by THX's founder Tomlinson Holman, and the 22.2 system proposed by Kimio Hamasaki, from the Japanese broadcaster NHK. All such systems are normally referred to as 3D layouts, as they include loudspeakers at different heights, and are capable of delivering better experiences than present 5.1 or 7.1 systems.
However, all such proposals share a number of drawbacks. They all require complex procedures already at the content production phase, since content has to take into account the variety of possible reproduction formats while being produced. Content production has to cater for the most complex reproduction format as well as for the simpler ones. In content production for layouts with many loudspeakers, the complexity is large, as sound engineers need to constantly take decisions which require coping with the whole layout in mind, such as how to route a particular given audio track to a particular loudspeaker (for example, the top-center-far-left channel). This mental exercise limits their creativity by focusing on technical tasks rather than aesthetical processes relating to the reproduced sound image.
Loudspeaker installation difficulty is another drawback of all mentioned prior art systems. All such multichannel formats require precise location of every loudspeaker in the reproduction venue, following a given standard, be it a professional cinema or a home environment. This is a complex and time consuming task requiring the assistance of expert sound technicians. In many cases, correct positioning of all loudspeakers is simply impossible due to specific venue constraints, like location of fire sprinklers, columns, small ceiling height, air-conditioning pipes, and so forth. This disadvantage in loudspeaker layout is bearable in systems with a low number of channels, like stereo. However it becomes hard to cope with, and therefore unrealistic, as the number of channels increases.
Certain developments have attempted to solve these problems by implementing audio workflows whereby content creation is completely decoupled from content reproduction. Such workflows are based on a new paradigm in which the production and postproduction processes are completely independent of the specifics of the reproduction layout. In particular, in such workflows, the output of post-production is a soundtrack, normally in digital support, whose generation is based on a variety of sound encoding techniques which do not depend on the number and location of the independent channels in the intended reproduction venues.
Early examples of such encoding techniques are Ambisonics and Vector-Based Amplitude Panning. Other examples of intermediate channel-independent encoding methods are disclosed by Jot and Pulkki. In these latter works, by dividing the audio recording in time-frequency bins, and analyzing the cross correlation among the different channels, a spatial location is assigned to each one of the time-frequency bins. One of the major drawbacks of these prior art methods is that the time-frequency decomposition inevitably produces audible processing artifacts which reduces the quality of the final reproduction. This limits the applicability of these methods in situations where only the highest quality reproduction is accepted. The audible processing artifacts are themselves also magnified as the number of channels increases. Hence the possibility of offering high quality reproduction in 3D environments using a plurality of channels is severely limited.
Many sound sources do not originate from a single point of space, but rather they have some intrinsic spatial extension, For instance, ambient sounds are frequently extended over a large spatial area. Another obvious example is the sound of a large truck, which is perceived as a noise extended over a wide area. However, all methods for channel-independent audio encoding exhibit limitations in the assignment, manipulation and reproduction of the apparent size of sounds, especially when complex sizes are intended. In particular, apparent sound shapes consisting of multiple disconnected areas, are very difficult, if not impossible, to attain with current existing audio encoding methods. Examples of such sound shapes consisting in multiple disconnected areas are the urban noise coming from different streets, or lateral reverberation sounds.
It is therefore necessary to provide solutions to the aforementioned drawbacks. In particular, it is desirable to encode sounds in a manner that is completely channel-independent, and therefore, reproducible in any arbitrary 3D loudspeaker layouts. It is also desirable to accomplish this without generating any audible artifacts. Furthermore, it is desirable to facilitate the creation and manipulation of sounds with complex apparent size, including the possibility of multiple disconnected shapes.