In contrast to lossy audio coding techniques (like mp3, AAC etc.), lossless compression algorithms can only exploit redundancies of the original audio signal to reduce the data rate. It is not possible to rely on irrelevancies, as identified by psycho-acoustical models in state-of-the-art lossy audio codecs. Accordingly, the common technical principle of all lossless audio coding schemes is to apply a filter or transform for de-correlation (e.g. a prediction filter or a frequency transform), and then to encode the transformed signal in a lossless manner. The encoded bit stream comprises the parameters of the transform or filter, and the lossless representation of the transformed signal.
See, for example, J. Makhoul, “Linear prediction: A tutorial review”, Proceedings of the IEEE, Vol. 63, pp. 561-580, 1975, T. Painter, A. Spanias, “Perceptual coding of digital audio”, Proceedings of the IEEE, Vol. 88, No. 4, pp. 451-513, 2000, and M. Hans, R. W. Schafer, “Lossless compression of digital audio”, IEEE Signal Processing Magazine, July 2001, pp. 21-32.
The basic principle of lossy based lossless coding is as follows: In the encoding section a PCM audio input signal SPCM passes through a lossy encoder to a lossy decoder and as a lossy bit stream to a lossy decoder of the decoding section, whereby lossy encoding and decoding is used to decorrelate the signal. The output signal of the encoding section lossy decoder is removed from the input signal SPCM, and the resulting difference signal passes through a lossless encoder as an extension bit stream to a decoding section lossless decoder. The output signals of the decoding section lossy decoder and lossless decoder are combined so as to regain the original signal SPCM.
This basic principle is disclosed in EP-B-0756386 and U.S. Pat. No. 6,498,811, and is also discussed in P. Craven, M. Gerzon, “Lossless Coding for Audio Discs”, J. Audio Eng. Soc., Vol. 44, No. 9, September 1996, and in J. Koller, Th. Sporer, K. H. Brandenburg, “Robust Coding of High Quality Audio Signals”, AES 103rd Convention, Preprint 4621, August 1997. In more detail, in the lossy encoder the PCM audio input signal SPCM passes through an analysis filter bank and a quantisation of sub-band samples to a coding and bit stream packing, wherein the quantisation is controlled by a perceptual model calculator that receives signal SPCM and corresponding information from the analysis filter bank.
At decoder side, the encoded lossy bit stream enters is depacked, and the lossy decoder decodes the subband samples and a synthesis filter bank outputs the decoded lossy PCM signal.
Examples for lossy encoding and decoding are described in detail in the standard ISO/IEC 11172-3 (MPEG-1 Audio).
The two or more different signals or bit streams resulting from the encoding are to be combined so as to form a single output signal. Similar solutions exist for example for MPEG Surround, mp3PRO and AAC+. For the two latter examples the additional amount of data (SBR information) to be added to the base layer data stream (AAC or mp3) is small. Therefore this additional information can be packed into a standard-conform AAC or mp3 bit stream e.g. as ‘ancillary data’. Although the additional amount of data for the surround information is bigger than that for the SBR information, these data can still be packed into a standard-conform bit stream in the same way.
Another application using similar techniques is the ID3 tag added to mp3 standard audio streams, as described in http://www.id3.org. The data is added at the beginning or end of the existing mp3 file. A special mechanism is used so that an mp3 decoder does not try to decode this additional information.
However, for lossy based lossless coding the additional amount of information exceeds the amount of data for the base layer by a multiple of the base layer data amount. Therefore the additional data cannot be packed completely into the base layer data stream e.g. as ancillary data. The at least two data streams resulting from the combination of lossy coding format with a lossless coding extension are the base layer containing the lossy coding information (e.g. a standard coding algorithm) and the enhancement data stream for rebuilding the mathematically lossless original input signal. Furthermore several intermediate layers are possible, each with an own data stream. However, these data streams are not independent. Every higher layer depends on the lower layers and can only be reasonably decoded in combination with these lower layers.
More generic, data formats use hierarchical layers, with a base layer BL and one or more enhancement layers EL. Data within a layer are often packetised, i.e. organised in packets or frames. While the BL signal alone can be decoded to obtain reproducible multimedia data and comprises all information for a basic decoding, the EL signal comprises additional information that cannot be decoded alone to obtain useful multimedia data. Instead, the EL data are tightly coupled to the BL data and can be used only together with the BL data. Usually the BL and EL data are added or superposed to each other, either for a common decoding or after their individual decoding. In either case it is necessary to synchronise the EL data to the BL data because otherwise the EL data will not represent useful information.
It is desirable to keep the data rate as low as possible, requiring sophisticated data compression methods. Variable length coding VLC is used for coding data words the value histogram of which is not equally distributed. Data words that appear more frequently, i.e. with higher probability, are encoded into shorter code words, while data words that appear with lower probability are encoded into longer code words. Thus, the average amount of bits in encoded messages is shorter than using constant code word length. However, high-compression processing using e.g. VLC is more sensitive to bit errors, which may lead to a complete data loss. In particular for VLC, following loss of synchronisation it is impossible to determine which one of the bits are belonging to a code word.
A known solution for limiting possible data loss is the insertion of unique synchronisation words that can be recognised with very high probability. However, such synchronisation words will increase the data rate, and the more synchronisation words are used the higher is the data rate.
Another challenge is to search for or seek—as fast as possible—a specific point of time within a running or stored audio program, i.e. to jump directly to a specific frame or sample in a track.
In the following description ‘seeking’ means searching in an audio bit stream. Therefore, seeking is a part of the audio decoder that enables a user to skip to a desired position within the encoded signal. Seeking positions are given by a number of samples to skip, the playback time or in percent of the total duration of a track.
The seeking processing strongly depends on the organisation of the audio format. Most of the established audio formats like MPEG-1 Layer III or AAC are streaming formats, which formats are organised in independent frames. Therefore, the decoder can start decoding from each frame without knowledge from a previous frame. For such streaming formats the following two seeking methods can be used.
The first seeking method is based on the condition that each frame has the same length and carries the same number of encoded samples. Then, the seeking position in percent of the total playback time is equivalent to the position in percent of the total bit stream (file) size. Therefore the decoder transforms a desired seeking position into a seeking position in percent of the total playback time, followed by starting decoding at the same percentage of the total bit stream length. However, the decoder needs to perform a resynchronisation to a bit stream frame located at the seeking position.
A more robust seeking processing in frame-based bit streams is to parse frame-by-frame from the beginning of the stream to the desired position. The number of encoded samples per frame and the length of the frame have to be known, but the frame size and the number of encoded samples per frame can be different for each frame. A drawback of such seeking processing is that the seeking latency depends on the seeking position. The more close the desired seeking position is to the end of the bit stream the more frames need to be parsed. On limited processing power architectures the required processing time can cause additional latencies or peaks in the processing load.
In file based formats the size of each frame is unknown and the above-described streaming format frame headers are neglected. The decoder can start decoding from the beginning of the file only. Frame Access Tables (FAT), or a cue point table data block representing a frame access table, are used to define designated entry points for seeking within the bit stream. These tables can contain one or more of e.g. block length, interval info in frames, number of table entries, pointer table. The cue points define entry points that allow starting decoding. Each entry point of the FAT is connected to a designated seeking position and therefore the decoder can start decoding at each table entry. The seeking accuracy is limited to the number of FAT entries or cue points.