Scalable coders are shown in EP 0 846 375 B1. In general, scalability is considered as the possibility to decode a subset of a bit stream representing a coded data signal, such as, for example, an audio signal or video signal, into a usable signal. This feature is especially desirable when, for example, a data transmission channel does not offer the required full bandwidth for transferring a complete bit stream. On the other hand, an incomplete decoding on a decoder having a low complexity is possible. In general, different discrete scalability layers are defined in practice.
An example of a scalable coder, as is, for example, defined in subpart 4 (General Audio) of part 3 (Audio) of the MPEG 4 standard (ISO/IEC 14496-3:1999 subpart 4), is shown in FIG. 1. An audio signal s(t) to be coded is fed into the scalable coder on the input side. The scalable coder shown in FIG. 1 comprises a first coder 12 which is an MPEG CELP coder. The second coder 14 is an AAC coder providing a high-quality audio coding and being defined in the MPEG 2 AAC standard (ISO/IEC 13818). The CELP coder 12 provides a first scaling layer via an output line 16 while the AAC coder 14 provides a second scaling layer to a bit stream multiplexer (BitMux) 20 via a second output line 18. On the output side, the bit stream multiplexer then outputs an MPEG 4 LATM bit stream 22 (LATM=Low Overhead MPEG 4 Audio Transport Multiplex). The LATM format is described in section 6.5 of part 3 (Audio) of the first supplement to the MPEG 4 standard (ISO/IEC 14496-3:1999/AMD1:2000).
The scalable audio coder also includes some further elements. First, there are a delay stage 24 in the AAC branch and a delay stage 26 in the CELP branch. By means of the two delay stages an optional delay for the respective branch can be adjusted. A down-sampling stage 28 is downstream of the delay stage 26 of the CELP branch to adapt the sample rate of the input signal s(t) to the sample rate demanded by the CELP coder. An inverse CELP decoder 30 is downstream of the CELP coder 12, the CELP coded/decoded signal being fed to an up-sampling stage 32. The up-sampled signal is then fed to a further delay stage 34, which, in the MPEG 4 standard, is referred to as “Core Coder Delay”.
The CoreCoderDelay stage 34 has the following function. If the delay is set to zero, the first coder 14 and the second coder 16 process exactly the same sample values of the audio input signal in a so-called superframe. A superframe can, for example, consist of three AAC frames which together represent a certain number of sample values no. x to no. y of the audio signal. The superframe further includes, for example, 8 CELP blocks, which, in the case of CoreCoderDelay=0, represent the same number of sample values and also the same sample values no. x to no. y.
If, however, a CoreCoderDelay D as a time quantity is set unequal to zero, the three blocks of AAC frames nevertheless represent the same sample values no. x to no. y. The eight blocks of CELP frames, however, represent sample values no. x−Fs D to no. y−Fs D, Fs being the sample frequency of the input signal.
The current time intervals of the input signal in a superframe for the AAC blocks and the CELP blocks can thus either be identical if CoreCoderDelay D=0 or, if D is unequal to zero, be shifted regarding one another by CoreCoderDelay. For the subsequent illustrations, however, CoreCoderDelay equaling zero is assumed for reasons of simplicity without limiting the generality so that the current time interval of the input signal for the first coder and the current time interval for the second coder are identical. In general, however, the only requirement for a superframe is that the AAC block/s or the CELP block/s in a superframe represent the same number of sample values, wherein the sample values themselves do not necessarily have to be identical but can also be shifted regarding one another by CoreCoderDelay.
It is to be noted that depending on the configuration the CELP coder processes a portion of the input signal s(t) faster than the AAC coder 14. In the AAC branch, a block decision stage 26 is downstream of the optional delay stage 24, which, among other things, determines whether short or long windows are to be used for windowing the input signal s(t), wherein short windows are to be selected for strongly transient signals while long windows are preferred for less transient signals, since in the latter the relation between payload data quantity and side information is better than in short windows.
A fixed delay by, for example, ⅝-fold a block is performed by the block decision stage 26 in the present example. In technology, this is referred to as a look ahead function. The block decision stage has to look ahead by a certain time in order to be able to determine whether there are transient signals in the future which have to be coded with short windows. Then, both corresponding signal in the CELP branch and the signal in the AAC branch are fed to means for converting the time representation into a spectral representation, which, in FIG. 1, are referred to by MDCT 36 and 38, respectively (MDCT=Modified Discrete Cosine Transform). The output signals of the MDCT blocks 36, 38 are then fed to a subtracter 40.
At this point, time-matching sample values have to be present, that is the delay in both branches has to be identical.
The following block 44 establishes whether it is more preferable to feed the input signal itself to the AAC coder 14. This is made possible via to the bypass branch 42. If it is, however, established that for example the difference signal at the output of the subtracter 40, as far as the energy is concerned, is smaller than the signal output by the MDCT block 38, not the original signal but the difference signal is taken to be coded by the AAC coder 14 in order to finally form the second scaling layer 18. This comparison can be performed band after band, which is indicated by a frequency-selective switching means (FSS) 44. The detailed functions of the individual elements are well-known in technology and are, for example, described in the MPEG 4 standard and in further MPEG standards.
An essential feature in the MPEG 4 standard and other coder standards is that the transfer of the compressed data signal is to take place via a channel with the constant bit rate. All the high-quality audio codecs operate in a block-based way, that is they process blocks of audio data (order of magnitude 480-1024 samples) to parts of a compressed bit stream which are also referred to as frames. The bit stream format thus has to be built up in such a way that a decoder without a priori information of where a frame starts is able to recognize the beginning of a frame in order to start outputting the decoded audio signal data with the smallest delay possible. Thus each header data block or determination data block of a frame begins with a certain synchronization word which can be searched for in a continue bit stream. Further conventional components in the data stream, apart from the determination data block, are the main data or “payload data” of the individual layers in which the actual compressed audio data is contained.
FIG. 4 shows a bit stream format having a fixed frame length. In this bit stream format, the headers or determination data blocks are inserted into the bit stream in an equidistant way. The side information and the main data belonging to this header follow directly. The length, i. e. number of bits, for the main data is the same in each frame. Such a bit stream format, as is shown in FIG. 4, is, for example, used in MPEG layer 2 or MPEG CELP.
FIG. 5 shows another bit stream format having a fixed frame length and a back pointer. In this bit stream format, the header and the side information are arranged in an equidistant way, as is the case in the format shown in FIG. 4. The beginning of the matching main data, however, only in exceptional circumstances, follows directly after a header. In most cases, the beginning is in one of the previous frames. The number of bits by which the beginning of the main data in the bit stream is shifted are transferred by the side information variable back pointer. The end of this main data can be in this frame or in one of the previous frames. The length of the main data thus is no longer constant. Thus, the number of bits with which a block is coded can be adapted to the features of the signal. At the same time, however, a constant bit rate can be obtained. This technology is referred to as “bit savings bank” or bit reservoir and increases the theoretical delay in the transfer chain. Such a bit stream format is, for example, used in MPEG layer 3 (MP3). The technology of the bit savings bank is also described in the MPEG layer 3 standard.
In general, the bit savings bank is a buffer of bits which can be employed to make more bits available for coding a block of time sample values than are actually allowed by the constant output data rate. The technique of the bit savings bank takes into consideration that some blocks of audio sample values can be coded with fewer bits than is preset by the constant transfer rate so that the bit savings bank fills with these blocks while other blocks of audio sample values have psycho acoustic features which do no allow such a great compression so that, for these blocks, the bits available are not sufficient for a low-interference or no-interference coding. The required additional bits are taken from the bit savings bank so that the bit savings bank is emptied with such blocks.
Such an audio signal, however, as is shown in FIG. 6, could also be transferred by a format having a variable frame length. In the bit stream format “variable frame length”, as is illustrated in FIG. 6, the fixed sequence of the bit stream elements header, side information and main data is kept to as in the “fixed frame length”. Since the length of the main data is not constant, the bit savings bank technique can be used in this case as well, wherein, however, no back pointers are required, as is the case in FIG. 5. An example of a bit stream format, as is illustrated in FIG. 6, is the transport format ADTS (Audio Data Transport Stream), as is defined in the MPEG 2 AAC standard.
It is to be noted that the previously mentioned coders are no scalable coders but only comprise a single audio coder.
In MPEG 4, the combination of different coders/decoders to a scalable coder/decoder is provided. It is thus possible and practical to combine a CELP voice coder as the first coder with an AAC coder for the further scaling layer/s and to pack it into a bit stream. The meaning of this combination is that there is a possibility to decode either all the scaling layers and thus obtain the best possible audio quality or to decode parts thereof, possibly only the first scaling layer with the corresponding limited audio quality. A reason for this sole decoding of the lowest scaling layer can be that, due to an insufficient bandwidth of the transfer channel, the decoder has only obtained the first scaling layer of the bit stream. Thus, in transferring, the parts of the first scaling layer in the bit stream are preferred compared to the second and further scaling layers, whereby the transfer of the first scaling layer is ensured in capacity bottle necks in the transfer net, while the second scaling layer may get lost completely or partly.
A further reason may be that a decoder wants to obtain the smallest possible codec delay and thus only decodes the first scaling layer. It is to be noted that the codec delay of a CELP codec in general is significantly smaller than the delay of the AAC codec.
In MPEG 4 version 2, the transport format LATM is standardized, which, among other things, can also transfer scalable data streams.
In the following, reference is made to FIG. 2a. FIG. 2a is a schematic illustration of the sample values of the input signal s(t). The input signal can be divided into different subsequent sections 0, 1, 2 and 3, wherein each section has a certain fixed number of time sample values. Usually the AAC coder 14 (FIG. 1) processes an entire section 0, 1, 2 or 3 to provide a coded data signal for this section. The CELP coder 12 (FIG. 1), however, conventionally processes a smaller amount of time sample values per coding step. Thus, it is exemplarily shown in FIG. 2b that the CELP coder or, put generally, the first coder or coder 1 has a block length which is a fourth of the block length of the second coder. It is to be noted that this separation is completely arbitrarily. The block length of the first coder could also be half the size, could, however, also be an eleventh of the block length of the second coder. Thus, the first coder produces four blocks (11, 12, 13, 14) from the section of the input signal, from which the second coder provides a block of data. In FIG. 2c a conventional LATM bit stream format is illustrated.
A superframe may have different ratios of number of ACC frames to number of CELP frames, as is illustrated in MPEG 4 by means of a table. Thus, a superframe can, for example, comprise an AAC block and 1 to 12 CELP blocks, 3 AAC blocks and 8 CELP blocks, but depending on the configuration also more AAC blocks than CELP blocks. An LATM frame having an LATM determination data block includes a superframe or even several superframes.
The production of the LATM frame opened by the header 1 is exemplarily described. First, the output data blocks 11, 12, 13, 14 of the CELP coder 12 (FIG. 1) are produced and buffered. In parallel, the output data block of the AAC coder, which, in FIG. 2c, is referred to by “1”, is produced. When the output data block of the AAC coder is produced, the determination data block (header 1) is written at first. Depending on the convention, the first-produced output data block of the first coder, which, in FIG. 2c, is referred to with 11, can be written, that is transferred, directly after the header 1. An equidistant interval of the output data blocks of the first coder is usually selected for a further writing or transferring, respectively, of the bit stream, as is illustrated in FIG. 2c (considering the little signaling information required). This means that, after writing or transferring, respectively of block 11, the second output data block 12 of the first coder, then the third output data block 13 of the first coder and finally the fourth output data block 14 of the first coder are written or transferred, respectively, in equidistant intervals. The output data block 1 of the second coder is inserted into the remaining gaps while transmitting. Then an LATM frame is written completely, that is transferred completely.
It is a disadvantage of this concept that the transfer of the data stream from the coder to the decoder can be started with at the earliest when all the data which has to be contained in the header is available. Thus the LATM header 1 can only be written, that is transferred, when the second coder (AAC coder 14 in FIG. 1) has completed its coding of the current section, since the LATM header, among other things, includes length information on the blocks in the superframe. Thus the output data block 11, the output data block 12, the output data block 13 and the output data block 14 of the first coder have to be buffered in the coder for some time until the second coder 14 which is usually slower, because it operates with a higher frame length, has produced the output data. Even if a decoder only wishes to decode the first scaling layer, that is blocks 11, 12, 13 and 14, it has to wait until the second coder has finished processing the currently considered section or block of the input signal, although the decoder is not interested in the second scaling layer at all. This is the case since the encoder writes the blocks of the first coder into the bit stream with a delay.
This feature is especially annoying in real-time operation. When, for example, a telephone conversation between two persons is considered, a CELP voice coder provides a relatively fast low-delay coding. When at both the sender- and receiver-side only a CELP voice coder is provided, a voice communication without undesired delays is possible. If, however, in both the sender and the receiver a scalable coder according to FIG. 1 is provided to be able to transfer, for example, voice and music in a high-quality way, the bit stream format shown in FIG. 2c leads to undesirably long delays which render a real time to and from communication almost impossible or so annoying that such a product would not have the slightest chance on the market.