Modern audiocoding methods process time-discrete audio sampled values to generate a bit stream which is compressed in relation to the original audio signal. The stream of time-discrete audio sampled values is first windowed so as to generate successive blocks of windowed audio sampled values from the stream of audio sampled values. The additional processing takes place blockwise. A block of audio sampled values generated by windowing is typically converted into a spectral representation by means of an analysis filter bank. The spectral representation comprises neighbouring frequency spectral values from the frequency 0 to the maximum audio frequency, which may e.g. be 16 kHz. The audio spectral values are grouped into scale factor bands and quantized. The quantization is so achieved that the quantization noise introduced by quantization is so dimensioned that it is masked by the audio signal. To this end a psychoacoustic model is used which, on the basis of the audio signal, supplies for each scale factor band an energy value which indicates the energy level up to which the quantization noise is masked, i.e. will not be audible in the decoded audio signal. If the quantization noise introduced by the quantizer should exceed the psychoacoustic masking threshold, the decoded audio signal will contain audible interference. The quantization stages of the quantizer are calculated in accordance with the masking threshold. When the quantization stages have been calculated, the audio spectral values are quantized in the light of these quantization stages to obtain quantized audio spectral values. For reasons of data efficiency the quantized audio spectral values are subjected to an entropy coding, e.g. a Huffman coding, to generate a bit stream with code words representing the audio spectral values. Side information is added to the stream of code words using a bit stream multiplexer. This side information contains, inter alia, the scale factors on the basis of which an audio decoder can ascertain the quantization stages which have been used in the encoder.
The audio decoding entails splitting the bit stream together with the side information into code words on the one hand and side information on the other using a bit stream demultiplexer. First, the entropy coding is revoked. The entropydecoded values, i.e. the quantized audio spectral values, are then subjected to an inverse quantization so as to obtain inverse quantized spectral values. These are then converted from the frequency domain to the time domain using a synthesis filter bank. The decoded audio signal is then present at the output of the synthesis filter bank.
It should be noted that the coding method used here entails loss since quantization has been performed in the encoder. The decoded audio signal does not correspond exactly to the original audio signal. If encoding and decoding were successful, the subjective impression made on the hearing by the decoded audio signal will, however, correspond to that made by the original audio signal since the quantization noise introduced in the encoder by the quantizer is masked out, i.e. it is “hidden” below the psychoacoustic masking threshold.
For reasons of data efficiency the quantization steps should preferably be as big as possible. On the other hand, if the quantization steps are too big, so too will be the quantization noise, which can manifest itself as audible interference in the decoded signal. Modern audiocoding methods strive for an optimal compromise between these two requirements.
The psychoacoustic masking threshold of an audio signal section depends on the actual input audio signal. If the audio signal changes with time, so too do the masking properties. For reasons of data efficiency it is preferable that as much quantization noise as possible should be introduced into the audio signal, i.e. the quantization noise should correspond as closely as possible to the psychoacoustic masking threshold. Audio signal sections with good masking properties can then be encoded with a relatively small bit outlay, whereas audio signal sections with relatively poor masking properties, such as e.g. tonal audio signal sections, must be quantized very finely, which means that a large number of bits must be expended in order to encode these audio signal sections. An encoder which tries to introduce just that amount of interference which is dictated by the masking threshold will therefore generate an audio signal of constant quality. Due to the time variant nature of the input signal this leads, however, to a variable bit rate at the output of the encoder. Although encoding with constant quality—and thus with a variable bit rate—is attractive as regards data efficiency on the one hand and audio quality on the other, this concept is disadvantageous in that it is only suitable for applications which support a variable transmission rate, such as e.g. the storage of compressed audio signals or the transmission of compressed audio signals over packet-based networks, e.g. the internet.
However, many applications require an audio encoder with a constant transmission rate. Due to the time variant nature of the spectral and temporal properties of an audio signal, this of necessity entails a variable quality. In particular, depending on the bit rate, it may happen that sections of the audio signal which have relatively poor masking properties cannot be quantized finely enough, i.e. are under-encoded, and may contain audible interference in the decoded signal, while easily encodable segments, i.e. audio signal sections with good masking properties, have to be encoded more precisely than necessary, i.e. are over-encoded.
To avoid the disadvantages of over-encoding and under-encoding a bit banking function is normally employed. The bit bank (Bitsparkasse) is filled when easily encodable audio sections are encoded. The bits which are not required to encode these easily encodable sections are not simply “wasted” through an unnecessarily fine quantization but instead a coarser quantization is used and the superfluous bits are “parked” in the bit bank.
If, on the other hand, it is a question of audio sections which are difficult to encode, i.e. for which a smaller quantization step width than is possible because of the required constant average data rate must be employed, the bit bank is “emptied” for this purpose so as to achieve a finer quantization than would otherwise be possible taking account of the required data rate, thus ensuring that there is no audible interference in these sections either in the decoded audio signal. The bit banking function thus serves as a buffer to transform an “inner” audio encoder with a variable bit rate into an “outer” audio encoder with a constant bit rate.
The distribution of music e.g. over the internet is now developing into an increasingly important technology. Most of the music content is compressed to save storage space and to speed up the transmission over transmission channels with limited bandwidth. Supervision of the use of the musical items distributed in transmission networks or tracing illegal copies of the same is, however, an ever increasing problem. While, on the one hand, wide distribution of audio items is desirable, copyrights must nevertheless be respected. In this context watermarking constitutes a useful mechanism for tracing illegal copies or for incorporating copyright information or quite generally the intellectual property into the items in the audio signal.
Incorporating watermarks into uncompressed multimedia data such as pictures, video, audio etc. is known. Incorporating watermarks into compressed material, however, requires a fast, quality-preserving watermarking method.
The technical publication “Audio Watermarking of MPEG-2-AAC Bit Streams”, Christian Neubauer, Jürgen Herre, 108th AES Convention, Paris 2000, Preprint 5101 first teaches that a spectral representation of an audio signal be generated. A spread and spectrally transformed watermark signal is then added to this. A new bit stream is generated from the sum signal through quantization and Huffman coding. This so-called bit stream watermarking method is characterized by a low degree of computational complexity since it is not necessary to fully decode the bit stream which is to be provided with a watermark. This method is also advantageous in that it provides high audio quality since the quantization noise and the watermark noise can be coordinated with each other if the energy introduced into the audio signal by the watermark lies below the psychoacoustic masking threshold. The method is also characterized by a high degree of robustness, since the watermark cannot be extracted from the decoded audio signal by an illegal distributor of the audio signal without detracting from the audio quality.
A disadvantage of the cited method is, however, that the quantization of the watermark-bearing signal may result in the watermark being quantized out or weakened. This is due to the fact that the energy of the watermark signal sometimes lies in the range of the quantization interval. Furthermore, it provides only limited control over the interference introduced by the watermark, which may result in a loss of audio quality.
A further watermarking method is the embedding of the watermark during the compression of the audio signal. This concept is described in the technical publication “Combined Compression/Watermarking for Audio Signals”, Frank Siebenhaar, Christian Neubauer and Jürgen Herre, 110th AES Convention, 12th to 15th May 2001, Amsterdam, Preprint 5344. An uncompressed audio signal is first presented to a psychoacoustic model to determine the masking threshold. The audio signal is then transformed into the frequency domain. The spread spectrally represented watermark signal is weighted in the light of the masking threshold in the frequency domain and added to the spectrum of the input audio signal. The parameters for the quantization are determined in the light of the masking threshold, whereupon the watermark-bearing signal is quantized and encoded. This method too is characterized by a low degree of computational complexity since combining the embedding of the watermark and the encoding means that certain operations, such as e.g. the calculation of the masking model and the transposing of the audio signal to a spectral representation only have to be performed once. The method also normally provides a good audio quality since quantization noise and watermark noise can be matched to each other.
A disadvantage of this method is, as above, that the quantization of the watermark-bearing signal may result in the watermark being quantized out or weakened. This is again due to the fact that the energy of the watermark signal sometimes lies in the range of the quantization interval. Furthermore, it provides only limited control over the interference introduced by the watermark, which may result in a loss of audio quality.
If the spectral representation of the audio signal is examined a plurality of audio spectral values can be seen. The spread watermark signal is also characterized by a plurality of spectral lines. To prevent the watermark from producing audible interference in the decoded audio signal, the height of the watermark spectral lines is, however, considerably less than the height of the audio signal spectral lines. After adding the watermark spectrum to the audio spectrum the combined spectrum is only a slight modification of the original spectrum. The quantization of the combined spectrum which follows will then remove the watermark without replacement if the quantization step width is greater than the height of the watermark spectral lines which are quantized with this quantization step width. If too many watermark spectral lines are “quantized out” by the subsequent quantization, the watermark detector can no longer extract an unambiguous watermark.