Herein, the expression “action sound” is used to denote sound indicative of spatially localized action occurring, during an event on a surface (e.g., a sporting event on a field), at a location (sometimes referred to herein as a “point of interest” or “PI”) on the surface. In this context, action occurring “at” a location on the surface denotes action occurring on or above the location. For example, action sound may be sound (e.g., a ball strike) generated on or above a location on a field, during a sporting event on the field, by one or more sporting event participants.
In sports broadcasting, action sound (e.g., ball strikes and other sounds by sporting event participants) is the most sought-after feature, yet often the most difficult to capture, due to the high level of unwanted sounds (e.g., crowd noise) and the unfeasibility of using close-miking.
In accordance with typical practice for capturing sound indicative of the action at a sporting event (e.g., soccer or football game) on a field (sometimes referred to as a pitch), a number of directional microphones (e.g., about twelve directional microphones) are located outside the edges of the field, and their output signals are manually mixed so that the largest gain is applied to the output(s) of the microphone(s) closer to the action (or pointing at it), while less gain is applied to the outputs of the others. This conventional sound capture method produces sub-optimal results and has disadvantages and limitations including the following:                Stress for the mixing engineer who must follow the action manually (e.g., with fingers on faders of a console);        Lack of scalability, in the sense that adding more microphones is unfeasible (too complex to mix); and        Poor action sound quality (adding more microphones at the side or end of a field doesn't improve the quality with which sound can be captured from inner areas of the field).        
Other conventional techniques for capturing sound indicative of action at an event (e.g., sporting event) on a field include using microphones (e.g., parabolic microphones or other hyper-directional microphones or microphone arrays (e.g., spherical or cylindrical)) located outside (e.g., along the side(s) and/or end(s) of) the field; and performing semi-automated mixing of the microphone output signals by controlling the gains applied to the output signals in response to tracking a (e.g., manually specifying a time-varying) point of interest (“PI”) on the field in real time during the event. Typically, a programmed processor system operates (in response to data indicative of the current PI) to determine automatically a mix of the microphone outputs in which the largest gain is applied to the output(s) of the microphone(s) closest to the current PI (or pointing at it) and less gain is applied to the outputs of the others.
An example of such semi-automated mixing of microphone signals is described in the paper “A New Technology for the Assisted Mixing of Sport Events: Application to Live Football Broadcasting,” by Giulio Cengarle, Toni Mateos, Natanael Olaiz, and Pau Arumi, Audio Engineering Society Paper No. 8037, published on May 1, 2010 (the “Cengarle paper”). As described in the Cengarle paper, microphones are positioned around a field. During a sporting event (e.g., a football or soccer game) on the field, a mixing engineer manually specifies a time-varying or time-invariant point of interest (“PI”) on the field (e.g., a sequence of different regions on the field at which action of interest is occurring) using a graphic user interface of a mixing system. The graphic user interface (e.g., implemented using a touch screen) displays a representation of the field, with a representation of the currently selected PI superimposed thereon. By manually controlling the position of the PI representation, the engineer specifies the currently selected PI. The mixing system is programmed to mix the outputs of the microphones (in response to data indicative of the current PI) to generate an audio mix which can be rendered (for playback by a loudspeaker or loudspeaker array) to provide a perception of sound (captured by all or some of the microphones) emitted at the spatial location corresponding to the currently selected PI (or a sequence of spatial locations corresponding to a time-varying selected PI). The mixing determines a gain to be applied to the output of each microphone in accordance with an algorithm whose parameters include: a parameter indicative of the physical distance between the microphone and the current PI; and another parameter indicative of whether only the outputs of microphones nearest to the PI (or the outputs of all of the microphones) should effectively participate in the mix. During sound capture, the engineer may for example use the user interface to control (e.g., vary as desired) the location of the PI while the mixing system automatically determines a corresponding mix of outputs of the microphones.
The method described in the Cengarle paper ameliorates one of the above-mentioned disadvantages of conventional sound capture: it reduces the stress on the engineer by automating part of his job. However, the inventors have recognized that since the Cengarle paper teaches positioning the action sound capturing microphones around the field, the method described therein does not ameliorate the fact that faraway microphones (microphones far from a selected PI) are typically used in an effort to capture action sound produced on the field during an event. Thus, the inventors have recognized that the method may generate a mix which is not indicative of the action sound (desired to be captured) or which is indicative of the action sound only with very low quality. Typical example embodiments disclosed herein address this limitation of the method described in the Cengarle paper, and generate a mix indicative with good quality of action sound while automating most of the signal processing required to generate the mix.
In some example embodiments disclosed herein, an audio mix (indicative of action sound emitted at a location or sequence of locations on a surface) is generated using subsurface microphones, and the mix is delivered (with corresponding metadata indicative of the location(s)) as an object channel of an object based audio program, which can be rendered to provide a perception (e.g., rendered for playback by an array of loudspeakers to provide an immersive perception) of the action sound.
Methods for generating and rendering object based audio programs are known. During generation of such programs, it may be assumed that the loudspeakers to be employed for rendering are located in arbitrary locations in the playback environment (or that the speakers are in a symmetric configuration in a unit circle). It need not be assumed that the speakers are necessarily in a (nominally) horizontal plane or in any other predetermined arrangements known at the time of program generation. Typically, metadata included in the program indicates rendering parameters for rendering at least one object of the program at an apparent spatial location or along a trajectory (in a three dimensional volume), e.g., using a three-dimensional array of speakers. For example, an object channel of the program may have corresponding metadata indicating a three-dimensional trajectory of apparent spatial positions at which the object (indicated by the object channel) is to be rendered. Examples of rendering of object based audio programs are described, for example, in PCT International Application No. PCT/US2001/028783, published under International Publication No. WO 2011/119401 A2 on Sep. 29, 2011, and assigned to Dolby Laboratories Licensing Corporation, the contents of which are incorporated herein in there entirety and PCT International Application No. PCT/US2014/031246, published under International Publication No. WO 2014/165326A1 on Oct. 9, 2014, and assigned to Dolby Laboratories Licensing Corporation and Dolby International AB, the contents of which are incorporated herein in there entirety.
Above-cited PCT Application Publication No. WO 2014/165326A1 describes object based audio programs which are rendered so as to provide an immersive, personalizable perception of the program's audio content. The content may be indicative of the atmosphere and/or action (e.g., game action) at and/or commentary on a spectator event (e.g., a soccer or rugby game, or another sporting event). The audio content of the program may be indicative of multiple audio object channels (e.g., indicative of user-selectable objects or object sets, and typically also a default set of objects to be rendered in the absence of object selection by the user) and at least one bed of speaker channels. For example, the object channels may include an object channel (which may be selected, with corresponding metadata, for rendering) indicative of commentary by an announcer, and a pair of object channels (which may be selected, with corresponding metadata, for rendering) indicative of left and right channels of sound produced by a game ball as it is struck by sporting event participants. The bed of speaker channels may be a conventional mix (e.g., a 5.1 channel mix) of speaker channels of a type that might be included in a conventional broadcast program which does not include an object channel.