Audio programs may comprise a plurality of audio objects in order to enhance the listening experience of a listener. The audio objects may be positioned at time-varying positions within a 3-dimensional rendering environment. In particular, the audio objects may be positioned at different heights and the rendering environment may be configured to render such audio objects at different heights.
The transmission of audio programs which comprise a plurality of audio objects may require a relatively large bandwidth. In order to reduce the bandwidth of such audio programs, the plurality of audio objects may be downmixed to a limited number of audio channels. By way of example, the plurality of audio objects may be downmixed to two audio channels (e.g. to a stereo downmix signal), to 5+1 audio channels (e.g. to a 5.1 downmix signal) or to 7+1 audio channels (e.g. to a 7.1 downmix signal). Furthermore, metadata may be provided (referred to herein as upmix metadata or joint object coding, JOC, metadata) which provides a parametric description of the audio objects that are comprised within the downmix audio signal. In particular, the upmix or JOC metadata may be used by a corresponding upmixer or decoder to derive a reconstruction of the plurality of audio objects from the downmix audio signal.
Within the transmission chain from an encoder (which provides the downmix signal and the JOC metadata) to a decoder (which reconstructs the plurality of audio objects based on the downmix signal and based on the JOC metadata), there may be the need for inserting an audio signal (e.g. a system sound of a settop box) into the bitstream comprising the downmix signal and the JOC metadata. The present document describes methods and systems which enable an efficient and high quality insertion of one or more audio signals into such a downmix signal.