Sound panning, the process of rendering audio indicative of a sound source which moves along a trajectory for playback by an array of loudspeakers, is a crucial component of typical audio program rendering. In the general case, the loudspeakers can be positioned arbitrarily. Thus, it is desirable to implement sound panning in a manner which accounts properly for the loudspeaker locations in the panning process, where the loudspeakers can have a wide range of loudspeaker positions. Ideally, the panning accounts properly for the positions of loudspeakers of any loudspeaker array, comprising any number of arbitrarily positioned speakers.
In a typical panning implementation, the source trajectory is defined by a set of time varying positional metadata, typically in three dimensional (3D) space using, for instance, a Cartesian (x,y,z) coordinate system. The loudspeaker positions can be expressed in the same coordinate system. Typically, the coordinate system is normalized to a canonical surface or volume.
Given a set of loudspeaker positions and the desired perceived sound source location(s), a panning process may include a step of determining which subset of loudspeakers (of a complete array of loudspeakers) will be used at each instant during the pan to create the proper perceptual image. The process typically includes a step of computing a set of gains, wi, with which the speakers of each subset (assumed to comprise “i” contributing speakers, where i is any positive integer) will playback a weighted copy of a source signal, S, such that the “i” th speaker of the subset is driven by a speaker feed proportional to:
            L      i        =                  w        i            *      S        ,            where      ⁢                          ⁢                        ∑          i                ⁢                  w          i          p                      =    1.  The gains are amplitude preserving if p=1, or power preserving if p=2.
Some conventional audio program rendering methods assume that the loudspeakers which will playback the program (e.g., at any instant during a pan) are arranged in a nominally two-dimensional (2D) space relative to a listener (e.g., a listener at the “sweet spot” of the speaker array). Other conventional audio program rendering methods assume that the loudspeakers which will playback the program (e.g., at any instant during a pan) are arranged in a three-dimensional (3D) space relative to a listener (e.g., a listener at the “sweet spot” of the speaker array).
Most conventional approaches to panning (e.g., vector-based amplitude panning or “VBAP”) assume that the array of available loudspeakers is structured with the speakers along a circle (a one-dimensional array of speakers) or at the vertices of a 3D triangular mesh (a 3D mesh whose faces are triangles) which approximates a sphere of possible source directions (e.g., the “Sphere” indicated in FIG. 13, which is fitted to the approximate positions of the six speakers shown in FIG. 13). The locations of the speakers of FIG. 13 are expressed relative to a Cartesian coordinate system, with one of the speakers of FIG. 13 at the origin, “(0,0,0),” of such coordinate system. Alternatively, conventional panning methods may express speaker locations relative to a coordinate system of another type (and the origin of the coordinate system need not coincide with the position of any of the speakers).
Herein, a “mesh” of loudspeakers denotes a collection of vertices, edges and faces which defines the shape of a polyhedral structure (e.g., when the mesh is three-dimensional), or whose periphery defines a polygon (e.g., when the mesh is two-dimensional), where each of the vertices is the location of a different one of the loudspeakers. Each of the faces is a polygon (whose periphery is a subset of the edges of the mesh), and each of the edges extends between two vertices of the mesh.
For example, to implement conventional direction-based 2D sound panning (known as “pair-wise panning”) with a sound playback system comprising a one-dimensional array of five speakers (e.g., those labeled as speakers 1, 2, 3, 4, and 5 in FIG. 1), the speakers may be assumed to be positioned along a circle centered at the location (location “L” in FIG. 1) of the assumed listener. For example, such a system may assume that speakers 1, 2, 3, 4, and 5 of FIG. 1, are positioned so as to be at least substantially equidistant from listener position L. To playback an audio program so that the sound emitted from the speakers is perceived as emitting from an audio source at a source location (relative to the listener) in the plane of the speakers (location “S” of FIG. 1), the two speakers spanning the source location (i.e., the two speakers nearest to the source location, and between which the source location occurs) may be determined, and gains to be applied to the speaker feeds for these two speakers may then be determined to cause the sound emitted from the two speakers to be perceived as emitting from the source location. For example, speakers 1 and 2 of FIG. 1 span the source location S, and the a typical conventional method would determine the gains to be applied to the speaker feeds for speakers 1 and 2 to cause the sound emitted from these speakers to be perceived as emitting from source location S. During a pan, as the source location moves (along a trajectory along the circle defined by the assumed speaker locations) relative to the listener, a typical conventional method may determine gains to be applied to the speaker feeds for each of a sequence of pairs of the available speakers.
For another example, to implement a typical type of conventional direction-based 3D sound panning (known as vector-based amplitude panning or “VBAP”) with a sound playback system comprising seven speakers (e.g., those labeled as speakers 10, 11, 12, 13, 15, 16, and 17 in FIG. 2), the speakers are assumed to be structured as a convex 3D mesh, whose faces are triangles, and enclosing the location (location “L” in FIG. 2) of the assumed listener. For example, the panning method may assume that the speakers 10, 11, 12, 13, 15, 16, and 17 of FIG. 2, are arranged in a mesh of triangles, with three of the speakers at the vertices of each of the triangles as shown in FIG. 2. To playback an audio program so that the sound emitted from the speakers is perceived as emitting from an audio source at a source location (location “S” in FIG. 2) relative to the listener, the triangle which includes the projection (location “S1” in FIG. 2) of the source location on the mesh (i.e., the triangle intersected by the ray from the listener location L to the source location S) may be determined. Then, the gains to be applied to the speaker feeds for the three speakers at the vertices of this triangle may be determined to cause the sound emitted from these three speakers to be perceived as emitting from the source location. For example, speakers 10, 11, and 12 of FIG. 2 are located at the vertices of the triangle which includes the projection (location “S1” in FIG. 2) of source location S on the mesh, and an example of such a method would determine the gains to be applied to the speaker feeds for speakers 10, 11, and 12 to cause the sound emitted from them to be perceived as emitting from source location S. During a pan, as the source location moves (along a trajectory projected on the mesh) relative to the listener, a typical conventional method may determine gains to be applied to the speaker feeds for each triplet of speakers at the vertices of each triangle, of a sequence of triangles, which includes the current projection of the source location on the mesh.
However, conventional directional panning methods are not optimal for implementing many types of sound pans, and do not support speakers which are arbitrarily located inside the listening volume or region. Other conventional panning methods, such as distance-based amplitude panning (DBAP), are position-based, and rely on a direct distance measure between each loudspeaker and the desired source location to compute panning gains. They can support arbitrary speaker arrays and panning trajectories but tend to cause too many speakers to be fired at the same time, which leads to timbral degradation. Conventional VBAP panning methods cannot stably implement pans in which a source moves along any of many common trajectories. For instance, source trajectories (which cross the volume defined by the mesh of speakers) near the “sweetspot” can induce fast direction changes (of the source position relative to the assumed listener position at the sweetspot) and therefore abrupt gain variations. For example, during pans along many typical source trajectories, especially when the mesh comprises elongated speaker triangles, a conventional VBAP method may drive pairs of speakers (i.e., only two speakers at a time) during at least part of the pan's duration, and/or the positions of consecutively driven pairs or triplets of speakers may undergo sudden, large changes during at least part of the pan's duration which are perceivable and distracting to listeners. For example, the driven speakers may comprise a rapid succession of: two speakers separated by a small distance, and then another pair of speakers separated by a much larger distance, and then another pair of speakers separated by a relatively small distance, and so on. Such unstable panning implementations (implementations which are perceived as being unstable) may be especially common when the pan is along a diagonal source trajectory relative to the listener (e.g., where the source moves both to the left and/or right, and the front and/or back, of the room enclosing the speakers and the listener).
Another type of audio rendering is described in PCT International Application No. PCT/US2012/044363, published under International Publication No. WO 2013/006330 A2 on Jan. 10, 2013, and assigned to the assignee of the present application. This type of rendering may assume an array of loudspeakers organized into several two-dimensional planar layers (horizontal layers) at different elevations. The speakers in each horizontal layer are axis-aligned (i.e., each horizontal layer comprises speakers organized into rows and columns, with the columns aligned with some feature of the listening environment, e.g., the columns are parallel to the front-back axis of the environment). For example, speakers 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, and 31 of FIG. 3 (or FIG. 4 or 5) are the speakers of one horizontal layer of an example of such an array. Speakers 20-31 (of FIG. 3, 4, or 5) are organized into five rows (e.g., one row including speakers 20, 21, and 22, and another row including speakers 31 and 23) and five columns (e.g., one column including speakers 29, 30, and 31, and another column including speakers 20 and 28). Speakers 20, 21, and 23 may be positioned along the front wall of a room (e.g., a theater) near the ceiling, and speakers 26, 27, and 28 may be positioned along the room's rear wall (also near the ceiling). A second set of twelve speakers may be positioned in a lower horizontal layer (e.g., near the floor of the room). Thus, in the example of FIGS. 3-5, the entire array of speakers (including each horizontal layer of speakers) defines a rectangular mesh of speakers which encloses the assumed position of a listener (e.g., a listener assumed to be at the speaker array's “sweet spot”).
The entire array of speakers (including each horizontal layer of speakers) also defines a conventional convex 3D mesh of three-speaker (triangular) groups of speakers, which also encloses the assumed position of a listener (e.g., the “sweet spot”), with each face of the mesh being a triangle whose vertices coincide with the positions of three of the speakers. Such a conventional convex 3D mesh made of triangular groups of speakers is of the same type described with reference to FIG. 2.
To image an audio source at a source location outside the speaker array (e.g., outside the mesh of FIGS. 3-5), sometimes referred to as a “far-field” source location, PCT International Application No. PCT/US2012/044363 teaches using a conventional VBAP panning method (or a conventional wave field synthesis method). Such a conventional VBAP method is of the type described with reference to FIG. 2, and assumes that the speakers are organized as a conventional convex 3D mesh made of triangular groups of speakers (of the type described with reference to FIG. 2). To render an audio program (indicative of the source) so that the sound emitted from the speakers is perceived as emitting from the source at the desired far-field source location, the triangular face (triangle) which includes the projection of the source location on the triangular mesh is determined. Then, the gains to be applied to the speaker feeds for the three speakers at the vertices of this triangle are determined to cause the sound emitted from these three speakers to be perceived as emitting from the source location. Such a far-field source can be imaged by the conventional VBAP method as it is panned along a far-field trajectory projected on the 3D triangular mesh. Another alternative is to apply a 2D directional pair-wise panning method (e.g., such as that mentioned with reference to FIG. 1) in each one of the 2D layers and combine the resulting speaker gains as a function of the source elevation (z coordinate).
PCT International Application No. PCT/US2012/044363 also teaches performance of a “dual-balance” panning method to render an audio source at a source location inside the speaker array (e.g., inside the mesh of FIGS. 3-5), sometimes referred to as a “near-field” source location. The dual-balance panning method is a positional panning approach rather than a directional panning approach. It assumes that the speakers are organized in a rectangular array (comprising horizontal layers of speakers) which encloses the assumed position of the listener. However, the dual-balance panning method does not determine the projection of the source location on a rectangular face of this array, followed by determination of gains to be applied to speaker feeds for the speakers at the vertices of such a face to cause the sound emitted from the speakers to be perceived as emitting from the source location.
Rather, the dual-balance panning method determines, for each near-field source location, a set of left-to-right panning gains (i.e., a left-to-right gain for each speaker of one of the horizontal layers of the speaker array) and a set of front-to-back panning gains (i.e., a front-to-back gain for each speaker of same horizontal layer of the array). The method multiplies the front-to-back panning gain for each speaker of the layer (for each near-field source location) by the left-to-right panning gain for the speaker (for the same near-field source location) to determine (for each near-field source location) a final gain for each speaker of the horizontal layer. To implement a pan of the source by driving the speakers of the horizontal layer, a sequence of final gains is determined for each speaker of the layer, each of the final gains being the product of one of the front-to-back panning gains and a corresponding one of the left-to-right panning gains.
To render an arbitrary horizontal pan through a sequence of near-field source locations using the speakers in one horizontal plane (e.g., a pan indicative of motion of a source location relative to the listener along an arbitrary near-field trajectory projected on the horizontal plane, e.g., the trajectory of source S shown in FIG. 5), the method would typically determine a sequence of left-to-right panning gains (one left-to-right panning gain for each source location) to be applied to the speaker feeds for the speakers in the horizontal plane. For example, left-to-right panning gains for a source position S as shown in FIG. 3, may be computed for two speakers of each row of the speakers (in the horizontal plane of the source position) which includes speakers of two columns (of the speakers in the plane) enclosing the source position (e.g., for speakers 20 and 21 of the first row, speakers 31 and 23 of the second row, speakers 30 and 24 of the third row, speakers 29 and 25 of the fourth row, and speakers 28 and 27 of the back row, with the left-to-right panning gain for speakers 22 and 26 being set to zero). The method would typically also determine a sequence of front-to-back panning gains (one front-to back panning gain for each source location) to be applied to the speaker feeds for the speakers in the horizontal plane. For example, the front-to back panning gains for a source position S as shown in FIG. 4, may be computed for two speakers of each of the two rows of the speakers in the plane enclosing the source position (e.g., for speakers 30 and 31 of the left column, and for speakers 23 and 24 of the right column, with the front-to back panning gain for speakers 20, 21, 22, 25, 26, 27, 28, and 29 being set to zero). The sequence of gains (“final gains”) to be applied to the speaker feed for each speaker of the horizontal plane (to render the arbitrary horizontal pan) would then be determined by multiplying the front-to-back panning gains for the speaker by the left-to-right panning gains for the speaker (so that each final gain in the sequence of final gains is the product of one of the front-to-back panning gains and a corresponding one of the left-to-right panning gains).
To render an arbitrary pan (along a 3D “near-field” trajectory anywhere within the rectangular array) using the speakers in all horizontal planes of the rectangular mesh (e.g., a pan indicative of motion of a source location relative to a listener along an arbitrary 3D near-field trajectory within the mesh), gains for speaker feeds of the speakers in each horizontal plane of the mesh could be determined by dual-balance panning as described in the previous paragraph, for the projection (on the horizontal plane) of the source trajectory. Then, using the projection (on a vertical plane) of the source trajectory, a sequence of “elevation” weights would be determined for the gains for the speakers of each horizontal plane (e.g., so that the elevation weights are relatively high for a horizontal plane when the trajectory's projection, on the vertical plane, is in or near to the horizontal plane, and the elevation weights are relatively low for a horizontal plane when the trajectory's projection, on the vertical plane, is far from the horizontal plane). The sequence of gains (“final gains”) to be applied to the speaker feed for each speaker of each of the horizontal planes of the rectangular mesh (to render the arbitrary 3D pan) could then be determined by multiplying the gains for the speaker in each layer by the elevation weights.
For example, the dual-balance panning method could render an arbitrary pan along a 3D “near-field” trajectory anywhere within a rectangular array of speakers (of the type described with reference to FIGS. 3-5) including a set of “ceiling” speakers (in a top horizontal plane) and at least one set of lower (e.g., wall or floor) speakers (each set of lower speakers positioned in a horizontal plane below the top horizontal plane) in a theater. To pan in a vertical plane parallel to a side wall of the theater, the rendering system could pan through the ceiling speakers (i.e., render sound using a sequence of subsets of only the ceiling speakers) until an inflection point (a specific distance away from the movie screen, toward the rear wall) is reached. Then, a blend of ceiling and lower speakers could be used to continue the pan (so that the source is perceived as dipping downward as it moves to the rear of the theater). The blending between base and ceiling is not driven by a distance to the screen but by the Z coordinate of the source (and the Z coordinate of each 2D layer of speakers).
The described dual-balance panning method assumes a specific arrangement of loudspeakers (speakers arranged in horizontal planes, with the speakers in each horizontal plane arranged in rows and columns). Thus, it is not optimal for implementing sound panning using arbitrary arrays of loudspeakers (e.g., arrays which comprises any number of arbitrarily positioned speakers). Further, the dual-balance panning method does not assume that the speakers are organized as a mesh of polygons, and determine the projection of a source location (e.g., each of a sequence of source locations) on a face of such a mesh, and gains to be applied to the speaker feeds for the speakers at the vertices of such a face to cause the sound emitted from the speakers to be perceived as emitting from the source location. Rather than implementing efficient determination of only a gain for each speaker at a vertex of one polygonal face (of a speaker array organized as a mesh) and driving of only the speakers at the vertices of one such face (at any instant) to image a source at a source location, the dual-balance method determines gains (front-to-back and left-right panning gains) for all speakers of at least one horizontal plane of speakers of such an array and drives all speakers for which both the front-to-back and left-right panning gains are nonzero (at any instant).
Some embodiments of the present invention are directed to systems and methods that render audio programs that have been encoded by a type of audio coding called audio object coding (or object based coding or “scene description”). They assume that each such audio program (referred to herein as an object based audio program) may be rendered by any of a large number of different arrays of loudspeakers. Each channel of such object based audio program may be an object channel. In audio object coding, audio signals associated with distinct sound sources (audio objects) are input to the encoder as separate audio streams. Examples of audio objects include (but are not limited to) a dialog track, a single musical instrument, and a jet aircraft. Each audio object is associated with spatial parameters, which may include (but are not limited to) source position, source width, and source velocity and/or trajectory. The audio objects and associated parameters are encoded for distribution and storage. Final audio object mixing and rendering may be performed at the receive end of the audio storage and/or distribution chain, as part of audio program playback. The step of audio object mixing and rendering is typically based on knowledge of actual positions of loudspeakers to be employed to reproduce the program.
Typically, during generation of an object based audio program, the content creator may embed the spatial intent of the mix (e.g., the trajectory of each audio object determined by each object channel of the program) by including metadata in the program. The metadata can be indicative of the position or trajectory of each audio object determined by each object channel of the program, and/or at least one of the size, velocity, type (e.g., dialog or music), and another characteristic of each such object.
During rendering of an object based audio program, each object channel can be rendered (“at” a time-varying position having a desired trajectory) by generating speaker feeds indicative of content of the channel and applying the speaker feeds to a set of loudspeakers (where the physical position of each of the loudspeakers may or may not coincide with the desired position at any instant of time). The speaker feeds for a set of loudspeakers may be indicative of content of multiple object channels (or a single object channel). The rendering system typically generates the speaker feeds to match the exact hardware configuration of a specific reproduction system (e.g., the speaker configuration of a home theater system, where the rendering system is also an element of the home theater system).
In the case that an object based audio program indicates a trajectory of an audio object, the rendering system would typically generate speaker feeds for driving an array of loudspeakers to emit sound intended to be perceived (and which typically will be perceived) as emitting from an audio object having said trajectory. For example, the program may indicate that sound from a musical instrument (an object) should pan from left to right, and the rendering system might generate speaker feeds for driving a 5.1 array of loudspeakers to emit sound that will be perceived as panning from the L (left front) speaker of the array to the C (center front) speaker of the array and then the R (right front) speaker of the array.