The invention relates to the coding and decoding of audio signals, and particularly to the coding and decoding of audio signals using an intensity stereo process and a prediction.
The most advanced audio coding and decoding processes, operating e.g. to the MPEG Layer 3 standard, can compress the data rate of digital audio signals e.g. by a factor of twelve without markedly lowering their quality.
Apart from a great coding gain in the individual channels, e.g. the left channel L and right channel R, the relative redundancy and irrelevance of the two channels are also utilised in the case of stereo. The known methods which have already been used are the so-called MS stereo process (MS=centre-side) and the intensity stereo process (IS process).
The MS stereo process, which is known in the art, substantially utilises the relative redundancy of the two channels, with a sum of the two channels and a difference between them being calculated, then transmitted as modified channel data for the left and right channel respectively. That is to say, the MS stereo process has a precisely reconstructing action.
Unlike the MS stereo process, the intensity stereo process chiefly makes use of stereo irrelevance. It should be mentioned in connection with stereo irrelevance that the spatial perception of the human hearing system depends on the frequency of the audio signals perceived. At low frequencies both amount information and phase information in the two stereo signals is evaluated by the human hearing system, and perception of high frequency components is based mainly on analysis of the energy-time envelopes of both channels. Thus the exact phase information in the signals in both channels is not relevant to spatial perception. This feature of human hearing is utilised to make use of the stereo-irrelevance for further data reduction of audio signals by the intensity stereo process.
As the stereo intensity process cannot resolve precise local information at high frequencies, it is possible to transmit a joint energy envelope for both channels instead of two separate stereo channels L, R, from an intensity frequency limit defined in the encoder. In addition to the joint energy envelope roughly quantised direction information is also transmitted as side information.
As a channel is only partially transmitted when intensity stereo coding is used, the saving of bits may be up to 50%. It should be noted however that the IS process does not have a precisely reconstructing action in the decoder.
In the IS process hitherto employed in the MPEG standard, Layer 3, the fact that the IS process is active in a block of stereo-audio spectral values is indicated by a so-called mode_extension_bit, and each block has a mode_extension_bit assigned to it.
A theoretical representation of the known IS process is given in FIG. 1. Stereo-audio spectral values for a channel L 10 and a channel R 12 are totalled at a summation point 14 to obtain an energy envelope I=Li+Ri for the two channels. Li and Ri here represent the stereo-audio spectral values of the respective channels L and R in any scale factor band. As already mentioned, use of the IS process is only permissible above a certain IS frequency limit, in order to avoid inserting coding errors into the stereo-audio spectral values coded. The left and right channels therefore have to be coded separately within a range from 0 Hz to the IS frequency limit. The IS frequency limit as such is determined in a separate algorithm which does not form part of the invention. From this frequency limit upwards the encoder codes the total signal of the left channel 10 and right channel 12, formed at the summation point 14.
Scaling information 16 for channel L and scaling information 18 for channel R are necessary for decoding in addition to the energy envelope, i.e. the total signal of the left and right channels, which may e.g. be transmitted in the coded left channel. Scale factors for the left and right channels are transmitted in the intensity stereo process as implemented e.g. in MPEG Layer 2. However it should be mentioned here that, in the IS process in MPEG Layer 3 for IS-coded stereo-audio spectral values, intensity direction information is transmitted only in the right channel, and the spectral values are decoded again with this information as explained below.
The scaling information 16 and 18 is transmitted as side information in addition to the coded spectral values of channel L and channel R. A decoder delivers audio signal values decoded in a decoded channel Lxe2x80x2 20 and a decoded channel Rxe2x80x2 22, and the scaling information 16 for channel R and 18 for channel L is multiplied by the decoded stereo-audio spectral values for the respective channels in an L multiplier 24 and an R multiplier 26, as a means of decoding the originally coded stereo-audio spectral values.
Before IS coding is applied above a certain IS frequency limit or MS coding below that limit the stereo audio spectral values for each channel are grouped into so-called scale factor bands. The bands are adapted to the perception properties of the hearing system. Each band may be amplified with an additional factor, the so-called scale factor, which is transmitted as side information for the particular channel and which constitutes part of the scaling information 16 and 18 in FIG. 1. These factors are responsible for the formation of an interfering noise which is introduced by quantisation, in such a way that it is xe2x80x9cmaskedxe2x80x9d in respect of psycho-acoustic aspects and thus becomes inaudible.
FIG. 2a shows a format of the coded right channel R, used e.g. in an MPEG Layer 3 audio coding process. Any further mention of intensity stereo coding will relate to the MPEG layer 3 standard process. The individual scale factor bands 28, into which the stereo audio spectral values are grouped, are shown diagrammatically in the first line of FIG. 2a. In FIG. 2a these bands are shown equal in width purely for clarity; in practice their widths will not be equal, owing to the psycho-acoustic properties of the hearing system.
The second line of FIG. 2a contains coded stereo audio spectral values sp, which are non-zero below an IS frequency limit 32; the stereo audio spectral values in the right channel above the IS frequency limit are set to zero (zero_part) nsp, as already mentioned (nsp=zero spectrum).
The third line of FIG. 2 contains part of the side information 34 for the right channel. The part of the information 34 shown firstly comprises the scale factors skf for the range below the IS frequency limit 32 and the direction information rinfo 36 for the range above the frequency limit. The direction information is used to ensure rough local resolution of the IS coded frequency range in the intensity stereo process. Thus the direction information rinfo 36, also referred to as intensity positions (is_pos), is transmitted in the right channel instead of the scale factors. It should be mentioned again that the scale factors 34 corresponding to the scale factor bands 28 are still present in the right channel below the IS frequency limit. The intensity positions 36 indicate the perceived stereo imaging position (the ratio of left to right) of the signal source within the respective scale factor bands 28. In each band 28 above the IS frequency limit the decoded values of the stereo audio spectral values transmitted are scaled by the MPEG Layer 3 process, with the following scaling factors kL for the left channel and kR for the right one:
kL=is_ratio/(1+is_ratio)xe2x80x83xe2x80x83(1)
and
kR=1/(1+is_ratio)xe2x80x83xe2x80x83(2)
The equation for is_ratio is as follows:
is_ratio=tan(is_posxc2x7xcfx80/12)xe2x80x83xe2x80x83(3)
The value is_pos is quantised with 3 bits, only the values from 0 to 6 being valid position values. The left and right channels can be derived from the I signal (I=Li+Ri) in the following two equations:
Ri=Ixc2x7is_ratio/(1+is_ratio)=Ixc2x7kLxe2x80x83xe2x80x83(4)
Li=Ixc2x71/(1+is_ratio)=Ixc2x7kRxe2x80x83xe2x80x83(5)
Ri and Li are the intensity stereo decoded stereo audio spectral values. It should be mentioned here that the left channel format is analogous to the right channel format shown in FIG. 2a, although the combined spectrum I=Li+Ri rather than the zero spectrum is to be found above the IS frequency limit 32 in the left channel, and although ordinary scale factors are present rather than direction information is_pos for the left channel. The transition from the quantised total spectral values of non-zero to the zero values in the right channel can implicitly indicate the IS frequency limit to the decoder in MPEG Layer 3 standard.
The transmitted channel L is thus calculated in the encoder as the sum of the left and right channels, and the direction information transmitted may be defined by the following equation:
is_pos=nint[arctan(EL/ER)xc2x712/xcfx80]xe2x80x83xe2x80x83(6)
The nint[x] function represents the xe2x80x9cnext whole numberxe2x80x9d function, EL and ER being the energy in the respective scale factor bands of the left and right channels. This formulation of the encoder/decoder gives an approximate reconstruction of signals in the left and right channels.
The intensity stereo process is described in R G v d Waal, R N J Veldhuis: xe2x80x9cSubband Coding of Stereophonic Digital Audio Signalsxe2x80x9d, IEEE ICASSP, pages 3601-3604, and in J Herre, K Brandenburg, D Lederer: xe2x80x9cIntensity Stereo Codingxe2x80x9d, 96th AES Convention, Amsterdam 1994, Preprint 3799.
The use of prediction is already known in coding and decoding by means of an NBC encoder. Second order, backward adaptive prediction is used in particular. xe2x80x9cBackward adaptivexe2x80x9d means that no predictor coefficients need be transmitted, as the predictor in the encoder and in the decoder is fed with the same input signals. The prediction means in the decoder can consequently derive the prediction coefficients itself.
The mode of operation of a predictor is basically to supply an estimated value for the current signal based on the preceding input values. For tonal signals, i.e. signals with a rather narrow spectrum, the predictor error signal, i.e. the difference between the original spectrum and the estimated value, is considerably smaller than the original spectrum, and the prediction error signal can consequently be coded with less bits, thus producing further substantial data compression. Thus only this quantised prediction error signal is transmitted. The predictor of the decoder can derive the original signal from it.
If the input signals of the predictor are not tonal, as is the case e.g. with audio-signals representing audience applause, the prediction error signal may become stronger than the original signal. Prediction then produces a bit loss, i.e. it leads to an increase in the quantity of data to be coded. For this reason prediction may be switched on and off scale factor bandwise. Like intensity direction information, information as to whether prediction is used in a scale factor band or not is transmitted as side information.
If intensity stereo coding and prediction are to be used simultaneously in coding stereo audio spectral values, the following problem arises. The intensity stereo algorithm takes place before the prediction. This sequence is inevitable, as there would be no sense in producing a joint energy envelope for both channels and intensity direction information from prediction error signals. As already mentioned, the right channel containing the stereo audio spectral values 30 is set to zero in IS coding, as shown in FIG. 2a or 2b. These zero values are the input values of a predictor for the right channel. As a predictor calculates an estimated value from at least one preceding input value, what happens when the processing switches from a scale factor band 28 below the intensity stereo frequency limit 32 to a band above that limit is that the stereo audio spectral values 30 of the right channel above the IS frequency limitxe2x80x94which is not always the same and has to be continuously determined dependent on the audio signalxe2x80x94abruptly become zero. The predictor however will still transmit non-zero estimates for a certain time, and the error signal to be coded will therefore also be non-zero. As a result non-zero audio spectral values would have to be transmitted above the IS frequency limit in the right channel 12, leading to a breach of the conditions for the actual intensity stereo process.
Possibly because of the said difficulty in encoding and decoding audio signals with simultaneous use of prediction and intensity stereo coding, the two processes have not hitherto been used together, although simultaneous use of prediction and the IS process would assist in further compression of the data to be encoded. If the intensity stereo process is not used at all, to enable prediction to be carried out without any problems, the above-mentioned advantages of intensity stereo coding for data compression are not exploited. Another possibility would be to compress only the stereo audio spectral values below the intensity stereo frequency limit by means of a predictor, and to encode the stereo audio spectral values above the IS frequency limit 32 exclusively by the intensity stereo process. Prediction of the stereo audio spectral values 30 in the left channel above the intensity stereo frequency limit would however allow additional compression of the data to be encoded.
The problem of the invention is to provide a method of coding stereo audio spectral values, a method of decoding coded stereo audio spectral values, an apparatus for coding stereo audio spectral values or an apparatus for decoding coded stereo audio spectral values, wherein increased data compression is possible.
In accordance with a first aspect of the present invention, this problem is solved by a method of coding stereo audio spectral values, to obtain coded stereo audio spectral values, comprising the following steps: grouping the stereo audio spectral values in scale factor bands, with which scale factors are associated; intensity stereo coding the stereo audio spectral values in at least one of the scale factor bands, whereby one channel has intensity stereo coded stereo audio spectral values and another channel has stereo audio spectral values with a value of substantially zero; if the stereo audio spectral values in a scale factor band are intensity stereo coded, intensity stereo decoding the intensity stereo coded stereo audio spectral values of one channel in the scale factor band, to obtain intensity stereo decoded stereo audio spectral values for the other channel; making a first prediction with the intensity stereo decoded stereo audio spectral values of the other channel in the scale factor band, the results of the first prediction not being taken into account when the stereo audio spectral values of the other channel are coded; if the stereo audio spectral values in a scale factor band are not intensity stereo coded, making the first prediction with the stereo audio spectral values of the other channel in the scale factor band, to obtain the coded stereo audio spectral values of the other channel.
In accordance with a second aspect of the present invention, this problem is solved by a method of decoding stereo audio spectral values which are coded partly by the intensity stereo process and partly by means of a first and a second prediction and which have side information, comprising the following steps: ascertaining the presence of intensity stereo coding or of the first or second prediction of the stereo audio spectral values, which are grouped in scale factor bands, for each individual band on the basis of the side information; making a prediction corresponding to the second prediction, with stereo audio spectral values coded by means of the second prediction, in one channel, in order to cancel the second prediction; if there is intensity stereo coding in a scale factor band, carrying out intensity stereo decoding of the intensity stereo coded stereo audio spectral values of the one channel, to form intensity stereo decoded stereo audio spectral values for the other channel; making the prediction corresponding to the first prediction, with the intensity stereo decoded stereo audio spectral values of the other channel, the results of the prediction not being taken into account with decoded stereo audio spectral values of the other channel; if there is no intensity stereo coding in a scale factor band, making the prediction corresponding to the first prediction, in the other channel to form the decoded stereo audio spectral values of the other channel.
In accordance with a third aspect of the present invention, this problem is solved by an apparatus for coding stereo audio spectral values, comprising: a means for grouping the stereo audio spectral values in scale factor bands, with which scale factors are associated; a means for intensity stereo coding the stereo audio spectral values in at least one of the scale factor bands, whereby one channel has intensity stereo coded stereo audio spectral values and the other channel has stereo audio spectral values with a value of substantially zero; an intensity stereo decoder for decoding the intensity stereo coded stereo audio spectral values in a scale factor band; and a first predictor in the other channel, which has first, second and third switches, the first, second and third switches being in a first state when intensity stereo coded stereo audio spectral values are present, and the first, second and third switches being in a second state when no intensity stereo coded stereo audio spectral values are present; wherein the first predictor makes a first prediction with the stereo audio spectral values of the other channel, decoded by the intensity stereo decoder, when the first, second and third switches are in the first state, but wherein the results of the prediction are not taken into account with the coded stereo audio spectral values owing to the position of the first switch; and wherein the predictor makes the first prediction of the stereo audio spectral values in the scale factor band, to obtain the coded stereo audio spectral values of the other channel, when the first, second and third switches are in the second state.
In accordance with a fourth aspect of the present invention, this problem is solved by an apparatus for decoding stereo audio spectral values which are coded at least partly by the intensity stereo process and a first and a second prediction and which have side information, comprising: a first re-predictor for one channel of stereo audio spectral values with an input and an output; a second re-predictor for another channel of stereo audio spectral values with an input and an output; an intensity stereo decoder with an input and an output; a first change-over means in the other channel for connecting the output of the intensity stereo decoder to the input of the second re-predictor when intensity stereo coded stereo audio spectral values are present, and for connecting the input of the second re-predictor to the other channel of stereo audio spectral values when those values are not stereo intensity coded; and a second change-over means in the other channel for connecting the output of the second re-predictor in the other channel to an output for decoded stereo audio spectral values of the other channel, when the stereo audio spectral values are not intensity stereo coded, and for connecting the input of the second predictor in the other channel to the output for the decoded stereo audio spectral values of the other channel, when those values are intensity stereo coded.
The invention is based on the discovery that increased data compression is made possible by joint use of prediction and stereo intensity coding of stereo audio spectral values, and for this purposexe2x80x94in a further discovery of the inventionxe2x80x94prediction has to be deactivated for the right channel if intensity stereo coding for stereo audio spectral values is activated in the corresponding scale factor band. To allow further appropriate adaptation of the prediction however, so that correct prediction values can be delivered in the case of stereo audio spectral values not coded by intensity stereo coding, the right channel predictor, which has a zero spectrum with intensity stereo coded stereo audio spectral values, must also be supplied with the intensity stereo decoded stereo audio spectral values for the right channel. If this is not done the predictor becomes maladjusted, with the result that the data compression gained by prediction drops considerably for a time when IS coding is disconnected.
For the right channel, with coding by the stereo intensity process, the right channel predictor must thus also keep operating to some degree; it is fed with the uncoded stereo audio spectral values of the right channel. However the results of the right channel prediction must not be considered for coding the stereo audio spectral values, in order to fulfil the intensity stereo condition that the stereo audio spectral values for a scale factor band above the intensity stereo frequency limit should be set to zero.
Accordingly further adaptation of the prediction must always be possible, i.e. updating of the prediction coefficients must be possible when IS coded scale factor bands with a certain mean frequency alternate with non-IS coded scale factor bands with substantially the same mean frequency. This may e.g. be the case if the IS frequency limit varies from one block to the next, or if a scale factor band above the IS frequency limit is IS coded in one block and IS decoded in a succeeding block as shown in FIG. 2b. 
It has also already been mentioned that prediction can be deactivated scale factor bandwise in the case of highly non-tonal signals. But if, for example, there is audience applause the signals in that scale factor band will be tonal again, and hence prediction should be reactivated in the block which then has to be coded. Here again prediction must be capable of further adaptation, so that it supplies small prediction error signals, for high data compression, immediately after activation. In this application the terms xe2x80x9cactivationxe2x80x9d, xe2x80x9cdeactivationxe2x80x9d or xe2x80x9cconnectionxe2x80x9d and xe2x80x9cdisconnectionxe2x80x9d of the prediction are accordingly used in the sense that the predictor continues to be fed with input values and makes a prediction to enable it to update its prediction coefficients, but that the results of the prediction are not considered in the encoded signals.