As a method of coding an audio signal, a known typical method is, for example, a method of coding an audio signal by performing frame processing on the audio signal, using time segmentation with a temporally predetermined sample. In addition, the audio signal that is coded as described above and transmitted is decoded afterwards, and the decoded audio signal is reproduced by an audio reproduction system such as an earphone and speaker, or a reproduction apparatus.
In recent years, technologies for enhancing convenience for a user of a reproduction apparatus by mixing a decoded audio signal with an external audio signal, or by performing rendering so as to reproduce a decoded audio signal from an arbitrary position such as up, down, left and right. With this technology, at a remote conference conducted via a network, for example, a participant at a certain location can independently adjust spatial arrangement or volume of a sound of another participant at a different location. Furthermore, music enthusiasts can generate a remix signal of a music track interactively to enjoy music, by controlling vocal or various instrumental components of his or her favorite piece in a variety of ways, for example.
As a technology for implementing such an application, there is a parametric audio object coding technology (see PTL 1 and NPL 1, for example). For example, the Moving Picture Experts Group Spatial Audio Object Coding specification (MPEG-SAOC) which is in the process of being standardized in recent years has been developed as described in NPL 1.
Here, there is a coding technology which is similar to the SAC and is developed for the purpose of efficiently coding an audio object signal with low calculation amount, based on a parametric multi-channel coding technology (also known as Spatial Audio Coding (SAC)) represented by MPEG surround disclosed, for example, by NPL 2. With the coding technology similar to SAC, a statistical correlation between audio signals such as phase difference or level ratio between signals is calculated to be quantized and coded. This allows more efficient coding compared to the system in which audio signals are independently coded. MPEG-SAOC technology disclosed by above-described NPL 1 is obtained by extending the coding technology similar to SAC so as to be applied to audio object signals.
Assume that an audio space of a reproduction apparatus (parametric audio object decoding apparatus) in which the parametric audio object coding technology such as the MPEG-SAOC technology is used is an audio space that enables multi-channel surround reproduce of 5.1 surround sound system. In this case, in the parametric audio object decoding apparatus, a device called a transcoder converts a coded parameter based on an amount of statistics between audio object signals, using audio spatial parameters (HRTF coefficient). This makes it possible to reproduce the audio signal in an audio space arrangement suitable for an intention of a listener.
FIG. 1 is a block diagram which shows a configuration of an audio object coding apparatus 100 of a general parametric. The audio object coding apparatus 100 shown in FIG. 1 includes: an object downmixing circuit 101; a T-F conversion circuit 102; an object parameter extracting circuit 103; and a downmix signal coding circuit 104.
The object downmixing circuit 101 is provided with audio object signals and downmixes the provided audio object signals to monaural or stereo downmix signals.
The downmix signal coding circuit 104 is provided with the downmix signals resulting from the downmixing performed by the object downmixing circuit 101. The downmix signal coding circuit 104 codes the provided downmix signals to generate a downmix bitstream. Here, in the MPEG-SAOC technology, MPEG-AAC system is used as a downmix coding system.
The T-F conversion circuit 102 is provided with audio object signals and demultiplexes the provided audio object signals to spectrum signals specified by both time and frequency.
The object parameter extracting circuit 103 is provided with the audio object signals demultiplexed to the spectrum signals by the T-F conversion circuit 102 and calculates an object parameter from the provided audio object signals demultiplexed to the spectrum signals Here, in the MPEG-SAOC technology, the object parameters (extended information) includes, for example, object level differences (OLD), object cross correlation coefficient (IOC), downmix channel level differences (DCLD), object energy (NRG), and so on.
A multiplexing circuit 105 is provided with the object parameter calculated by the object parameter extracting circuit 103 and the downmix bitstream generated by the downmix signal coding circuit 104. The multiplexing circuit 105 multiplexes and outputs the provided downmix bitstream and the object parameter to a single audio bitstream.
The audio object coding apparatus 100 is configured as described above.
FIG. 2 is a block diagram which shows a configuration of a typical audio object decoding apparatus 200. The audio object decoding apparatus 200 shown in FIG. 2 includes: an object parameter converting circuit 203; and a parametric multi-channel decoding circuit 206.
FIG. 2 shows a case where the audio object decoding apparatus 200 includes a speaker of the 5.1 surround sound system. Accordingly, two decoding circuits are connected to each other in series in the audio object decoding apparatus 200. More specifically, the object parameter converting circuit 203 and the parametric multi-channel decoding circuit 206 are connected to each other in series. In addition, a demultiplexing circuit 201 and a downmix signal decoding circuit 210 are provided in a stage prior to the audio object decoding apparatus 200, as shown in FIG. 2.
The demultiplexing circuit 201 is provided with the object stream, that is, an audio object coded signal, and demultiplexes the provided audio object coded signal to a downmix coded signal and object parameters (extended information). The demultiplexing circuit 201 outputs the downmix coded signal and the object parameters (extended information) to the downmix signal decoding circuit 210 and the object parameter converting circuit 203, respectively.
The downmix signal decoding circuit 210 decodes the provided downmix coded signal to a downmix decoded signal and outputs the decoded signal to the object parameter converting circuit 203.
The object parameter converting circuit 203 includes a downmix signal preprocessing circuit 204 and an object parameter arithmetic circuit 205.
The downmix signal preprocessing circuit 204 generates a new downmix signal based on characteristics of spatial prediction parameters included in MPEG surround coding information. More specifically, the downmix decoded signal outputted from the downmix signal decoding circuit 210 to the object parameter converting circuit 203 is provided. The downmix signal preprocessing circuit 204 generates a preprocessed downmix signal based on the provided downmix decoded signal. At this time, the downmix signal preprocessing circuit 204 generates, at the end, a preprocessed downmix signal according to arrangement information (rendering information) and information included in the object parameters which are included in the demultiplexed audio object signal. Then, the downmix signal preprocessing circuit 204 outputs the generated preprocessed downmix signal to the parametric multi-channel decoding circuit 206.
The object parameter arithmetic circuit 205 converts the object parameters to spatial parameters that correspond to Spatial Cue of MPEG surround system. More specifically, the object parameters (extended information) outputted from the demultiplexing circuit 201 to the object parameter converting circuit 203 is provided to the object parameter arithmetic circuit 205. The object parameter arithmetic circuit 205 converts the provided object parameters to audio spatial parameters and outputs the converted parameters to the parametric multi-channel decoding circuit 206. Here, the audio spatial parameters correspond to audio spatial parameters of SAC coding system described above.
The parametric multi-channel decoding circuit 206 is provided with the preprocessed downmix signal and the audio spatial parameters, and generates audio signals based on the provided preprocessed downmix signal and audio spatial parameters.
The parametric multi-channel decoding circuit 206 includes: a domain converting circuit 207; a multi-channel signal synthesizing circuit 208; and an F-T converting circuit 209.
The domain converting circuit 207 converts the preprocessed downmix signal provided to the parametric multi-channel decoding circuit 206, into a synthesized spatial signal.
The multi-channel signal synthesizing circuit 208 converts the synthesized spatial signal converted by the domain converting circuit 207, into a multi-channel spectrum signal based on the audio spatial parameter provided by the object parameter arithmetic circuit 205.
The F-T converting circuit 209 converts the multi-channel spectrum signal converted by the multi-channel signal synthesizing circuit 208, into an audio signal of multi-channel temporal domain and outputs the converted audio signal.
The audio object decoding apparatus 200 is configured as described above.
It is to be noted that, the audio object coding method described above shows two functions as below. One is a function which realizes high compression efficiency not by independently coding all of the objects to be transmitted, but by transmitting the downmix signal and small object parameters. The other is a function of resynthesizing which allows real-time change of the audio space on a reproduction side, by processing the object parameters in real time based on the rendering information.
In addition, with the audio object coding method described above, the object parameters (extended information) are calculated for each cell segmented by time and frequency (the width of the cell is called temporal granularity and frequency granularity). A time division for calculating object parameters is adaptively determined according to transmission granularity of the object parameters. It is necessary to code the object parameters more efficiently in view of the balance between a frequency resolution and a temporal resolution with a low bit rate, compared to the case with a high bit rate.
In addition, the frequency resolution used in the audio object coding technology is segmented based on the knowledge of auditory perception characteristics of human. On the other hand, the temporal resolution used in the audio object coding technology is determined by detecting a significant change in the information of object parameters in each frame. As a referential one for each temporal segment, for example, one temporal segment is provided for each frame segment. When the referential segment is applied, the same object parameters are transmitted in the frame with the time length of the frame.
As described above, in order to obtain high coding efficiency on the side of a coding apparatus for audio object coding, the temporal resolution and the frequency resolution of each of the object parameters are adaptively controlled in many cases. In such adaptive control, the temporal resolution and the frequency resolution are generally changed according to complexity of information indicating audio signal of a downmix signal, characteristics of each object signal, and requested bit rate, as needed. FIG. 3 shows an example for this.
FIG. 3 shows a relationship between a temporal segment and a subband, a parameter set, and a parameter band. As shown in FIG. 3, a spectrum signal included in one frame is segmented into N temporal segments and k frequency segments.
In the mean time, with the MPEG-SAOC technology disclosed by above-described NPL 1, each frame includes a maximum of eight temporal segments according to the specification. In addition, when smaller temporal segment and frequency segment are applied, the audio quality after coding or distinction between sounds of each of the object signals naturally improves; however, the amount of information to be transmitted increases as well, resulting in the increase in the bit rate. As described above, there is a trade-off between the bit rate and the audio quality.
Thus, there is a method of temporal segment that is experimentally shown. To be specific, in order to assign an appropriate bit rate to an object parameter, at least one additional temporal segment is set so that one frame is segmented into one or two regions. Such a limitation enables an appropriate balance between the audio quality and the bit rate assigned to the object parameter. As to 0 or 1 additional segment, for example, the requested bit rate to the object parameter is approximately 3 kbps per an object, resulting in an additional overhead of 3 kbps per one scene. Thus, it is apparent that, in proportion to the increase in the number of objects, the parametric object coding method is more efficient than a general object coding method conventionally carried out.
As described above, it is possible to achieve an excellent audio quality with the object coding of high bit efficiency, by using the aforementioned temporal segment. However, it is not possible to always provide all of essential applications with coded audio with sufficient quality. In view of the above, a residual coding technique is introduced to the parametric coding technology so that a gap between the audio quality of the parametric object coding and a transparent audio quality.
In the general residual coding technique, a residual signal is related to a portion other than a main part of a downmix signal, in most cases. For simplification here, the residual signal is assumed to be a difference between two downmix signals. In addition, it is assumed that a frequency component with a low residual signal is transmitted so as to reduce a bit rate. In such a case, a frequency band of a residual signal is set on the side of the coding apparatus, and a trade-off between a consumed bit rate and reproduction quality is adjusted.
On the other hand, with the MPEG-SAOC technology, it is only necessary to hold a frequency band of 2 kHz as a useful residual signal, and the audio quality is clearly improved by performing coding with 8 kbps per one residual signal. Thus, for an object signal to which a high audio quality is required, the bit rate of 3+8=11 kbps per one object is assigned to an object parameter. Accordingly, it is considered that a requested bit rate becomes extremely high with plenty of width, when the application requires a high quality multi-object.