Scalable encoders are shown in EP 0 846 375 B1. In general, scalability is understood as the possibility of decoding a partial section of a bit stream representing an encoded data signal, e.g. an audio signal or a video signal into a useful signal. This property is particularly desirable when e.g. a data transmission channel fails to provide the complete bandwidth necessary for transmitting a complete bit stream. On the other hand, an incomplete decoding is possible on a decoder with reduced complexity. Generally, different discrete scalability layers are defined in practice.
An example of a scalable encoder as defined in Subpart 4 (General Audio) of Part 3 (Audio) of the MPEG-4 Standard (ISO/IEC 14496-3; 1999 Subpart 4) is shown in FIG. 1. An audio signal s(t) to be encoded is fed into the scalable encoder on the input side. The scalable encoder shown in FIG. 1 contains a first encoder 12, which is an MPEG Celp encoder. The second encoder 14 is an AAC encoder, which provides high-quality audio encoding and is defined in the Standard MPEG-2 AAC (ISO/IEC 13818). The Celp encoder 12 provides a first scaling layer via an output line 16, while the AAC encoder 14 provides a second scaling layer via a second output line 18, to a bit stream multiplexer (BitMux) 20. On the output side the bit stream multiplexer then outputs an MPEG-4-LATM bit stream 22 (LATM=Low-Overhead MPEG-4 Audio Transport Multiplex). The LATM format is described in Section 6.5 of Part 3 (Audio) of the first supplement to the MPEG-4 Standard (ISO/IEC 14496-3:1999/AMD1:2000).
The scalable audio encoder further includes some further elements. First, there exists a delay stage 24 in the AAC branch and a delay stage 26 in the Celp branch. With both delay stages it is possible to set an optional delay for the respective branch. A downsampling stage 28 is downstream of the delay stage 26 of the Celp branch to adjust the sampling rate of the input signal s(t) to the sampling rate requested by the Celp encoder. An inverse Celp decoder 30 is downstream to the Celp encoder 12, wherein the Celp encoded/decoded signal is then supplied to an upsampling stage 32. The upsampled signal is then supplied to a further delay stage 34, which is termed “Core Coder Delay” in the MPEG-4 Standard.
The stage CoreCoderDelay 34 has the following function. If the delay is set to zero, the first encoder 14 and the second encoder 12 process exactly the same samples of the audio input signal in a so-called superframe. A superframe might e.g. consist of three AAC frames, which together represent a certain number of samples No. x to No. y of the audio signal. The superframe further includes e.g. 8 CELP blocks, which represent the same number of samples and also the same samples No. x to No. y if CoreCoderDelay=0.
If, however, a CoreCoderDelay D is set as a time value other than zero, the three blocks of AAC frames nevertheless represent the same samples No. x to No. y. The eight blocks of CELP frames, in contrast, represent the samples No. x−Fs D to No. y−Fs D, wherein Fs is the sampling frequency of the input signal.
The current time sections of the input signal in a superframe for the AAC blocks and the CELP blocks can thus be either identical, when CoreCoderDelay D=0, or be shifted relative to each other by CoreCoderDelay, when D is not equal to zero. For the following implementations, however, it will be assumed, on the grounds of simplicity and without restriction of generality, that CoreCoderDelay=0, so that the current time section of the input signal for the first encoder and the current time section for the second encoder are identical. In general, however, the only requirement for a superframe is, that the AAC block(s) and the CELP block(s) in a superframe represent the same number of samples, wherein it is not necessary for the samples themselves to be identical to one another, but they may also be shifted relative to each other by CoreCoderDelay.
It should be noted that the Celp encoder, depending on the configuration, may process a section of the input signal s(t) faster than the AAC encoder 14. In the AAC branch a block decision stage 26 is downstream to the optional delay stage 24 which establishes among other things whether short or long windows should be used for windowing the input signal s(t), wherein short windows must be chosen for strongly transient signals, while long windows are preferred for less transient signals since the relationship between the amount of payload data and page information is better than for short windows.
By the block decision stage 26 a fixed delay by e. g. ⅝ times a block is performed in the present example. This is referred to as a look-ahead function in the art. The block decision stage must already look ahead a certain time to be able to determine whether there are transient signals in future that must be encoded with short windows. After that the corresponding signal in the Celp branch as well as the signal in the AAC branch are fed to means for converting the time-related illustration to a spectral illustration, which is designated as MDCT 36 or 38, respectively, in FIG. 1 (MDCT=modified discrete cosine transform). The output signals of the MDCT blocks 36, 38 are then supplied to a subtracter 40.
At this point, samples belonging together regarding time must be present, i.e. the delay must be identical in both branches.
The following block 44 determines whether it is more favorable to supply the input signal itself to the AAC encoder 14. This is enabled via the bypass branch 42. If it is determined, however, that the differential signal at the output of the subtracter 40 is smaller regarding energy than the signal output by the MDCT block 38, then not the original signal but the differential signal is taken to be encoded by the AAC encoder 14 to finally form the second scaling layer 18. This comparison may be performed band by band, which is indicated by frequency-selective switching means (FSS) 44. The exact functions of the individual elements are known in the art and are described for example in the MPEG-4 standard as well as in further MPEG standards.
One main feature in the MPEG-4 standard and in other encoder standards, respectively, is that the transmission of the compressed data signal is to be performed with a constant bit rate via a channel. All high-quality audio codecs operate based on blocks, i.e. they process blocks of audio data (order 480-1024 samples) to pieces of a compressed bit stream, which are also referred to as frames. The bit stream format must here be set up so that a decoder without a priory information where a frame starts is able to recognize the beginning of a frame in order to start the output of decoded audio signal data with a lowest possible delay. Thus, each header or determining data block of a frame starts with a certain synchronization word which may be searched for in a continuous bit stream. Further common components within the data stream apart from the determining data block are the main data or “payload data” of the individual layers in which the actual compressed audio data is contained.
FIG. 4 shows a bit stream format with a fixed frame length. In this bit stream format the headers or determining data blocks are inserted equidistantly into the bit stream. The side information associated with this header and the main data follow immediately afterwards. The length, i.e. the number of bits, for the main data is the same in each frame. Such a bit stream format as it is shown in FIG. 4 is for example used in the MPEG layer 2 or the MPEG-CELP.
FIG. 5 shows another bit stream format with a fixed frame length and a backpointer. In this bit stream format the header and the side information are arranged equidistantly as in the format illustrated in FIG. 4. The start of the associated main data is, however, only performed exceptionally directly following a header. In most cases the start is in one of the preceding frames. The number of bits by which the start of the main data is shifted in the bit stream is transferred by the page information variable backpointer. The end of these main data may lie within this frame or within a preceding frame. The length of the main data is therefore not constant any more. Therefore, the number of bits with which a block is encoded may be adjusted to the characteristics of the signal. Simultaneously, a constant bit rate may be achieved, however. This technology is called “bit savings bank” and increases the theoretical delay within the transmission chain. Such a bit stream format is for example used in the MPEG layer 3 (MP3).
The technology of the bit savings bank is further described in the standard MPEG layer 3.
Generally, the bit savings bank represents a buffer of bits which may be used to provide more bits for encoding a block of time sample as is actually allowed by the constant output data rate. The technology of the bit savings bank takes into account that some blocks of audio samples may be encoded with less bits than predetermined by the constant transmission rate, so that through these blocks the bit savings bank is filled, while again other blocks of audio samples comprise psychoacoustic characteristics which do not allow such a high compression so that for these blocks the available bits would actually not be enough for a low-interference or interference-free encoding, respectively. The additional bits needed are taken from the bit savings bank so that the bit savings bank is emptied with such blocks.
Such an audio signal may, however, be also transmitted by a format with a variable frame length, as it is shown in FIG. 6. With the bit stream format “variable frame length”, as it is illustrated in FIG. 6, the fixed sequence of the bit stream elements header, page information and main data is maintained, as with the “fixed frame length”. As the length of the main data is not constant, the bit savings bank technology may also be used here, there are, however, no backpointers needed as in FIG. 5. One example for a bit stream format, as it is illustrated in FIG. 6, is the transport format ADTS (audio data transport stream), as it is defined in the standard MPEG 2 AAC.
It is to be noted that the above-mentioned encoders are no scalable encoders but include only one single audio encoder.
In MPEG 4 the combination of different encoder/decoders to a scalable encoder/decoder is provided. It is therefore possible and sensible to combine one CELP voice encoder as the first encoder with an AAC encoder for the further scaling layer(s) and pack the same into one bit stream. The purpose of this combination is that the possibility remains open either to decode all scaling layers and therefore reach a best possible audio quality, or parts of the same, maybe even only the first scaling layer, with the correspondingly restricted audio quality. Reasons for only decoding the lowest scaling layer may be that due to a bandwidth of the transmission channel which is too small, the decoder only received the first scaling layer of the bit stream. Because of this the parts of the first scaling layer in the bit stream are favored over the second and the further scaling layers in the transmission, whereby the transmission of the first scaling layer is guaranteed with capacity bottlenecks in the transmission network, while the second scaling layer may be lost completely or in part.
A further reason may be that a decoder wants to achieve a lowest possible codec delay and therefore decodes only the first scaling layer. It is to be noted that the codec delay of a Celp code is generally significantly smaller than the delay of the AAC code.
In MPEG 4 version 2 the transport format LATM is standardized, which may among other things also transmit scalable data streams.
In the following, reference is made to FIG. 2a. FIG. 2a is a schematical illustration of the samples of the input signal s(t). The input signal may be divided into different successive sections 0, 1, 2, 3, wherein each section comprises a certain fixed number of time samples. Usually, the AAC encoder 14 (FIG. 1) processes a whole section 0, 1, 2 or 3 in order to provide an encoded data signal for this section. The CELP encoder 12 (FIG. 1), however, processes usually a smaller amount of time samples per encoding step. Thus, it is shown as an example in FIG. 2b, that the CELP encoder or generally speaking the first encoder or encoder 1 comprises a block length which is one fourth of the block length of the second encoder. It is to be noted that this division is completely random. The block length of the first encoder may also be half as long, might, however, also be one eleventh of the block length of the second encoder. Thus, the first encoder will generate four blocks (11, 12, 13, 14) from the section of the input signal, from which the second encoder provides one block of data. In FIG. 2c a common LATM bit stream format is shown.
One superframe may comprise several ratios of number of AAC frames to number of CELP frames, as it is illustrated in tabular form in MPEG 4. Thus, a superframe may for example comprise one AAC block and 1 to 12 CELP blocks, 3 AAC blocks and 8 CELP blocks but also e.g. for example more AAC blocks than CELP blocks, depending on the configuration. An LATM frame which comprises an LATM determining data block includes a superframe or also several superframes.
The generation of the LATM frame opened by the header 1 is described as an example. First, the output data blocks 11, 12, 13, 14 of the Celp encoder 12 (FIG. 1) are generated and buffered. In parallel, the output data block of the AAC encoder designated with “1” in FIG. 2c is generated. Then, when the output data block of the AAC encoder has been generated, first of all the determining data block (header 1) is written. Depending on the convention, the output data block of the first encoder which was generated first, designated with 11 in FIG. 2c, may be written, i.e. transmitted, directly following header 1. Usually (regarding the few necessary signalizing information) an equidistant distance of the output data blocks of the first encoder is selected for a further writing and/or transmitting of the data stream, as it is illustrated in FIG. 2c. This means, that after writing and/or transmitting block 11 the second output data block 12 of the first encoder, then the third output data block 13 of the first encoder and then the fourth output data block 14 of the first encoder are written and/or transmitted in equidistant distances. The output data block 1 of the second encoder is filled into the remaining gaps during the transmission. Then, an LATM frame is fully written, i.e. fully transmitted.
One disadvantage of the bit stream formats illustrated in FIG. 4 to 6 is the fact that they are only known for simple encoders, not, however, for scalable encoders and in particular not for scalable encoders having a bit savings bank function.
As it is known, the bit savings bank is used so that the variable output data rate which a psychoacoustic encoder generates inherently may be adjusted to a constant output data rate. In other words, the number of bits an audio encoder needs depends on the signal characteristics. If the signal is comprised such that it may be quantized in relatively coarse way, then a relatively low amount of bits is needed for encoding this signal. If the signal is, however, comprised such that it has to be quantized very finely, a relatively low amount of bits is needed for encoding this signal. If the signal is, however, comprised such that it needs to be quantized very finely in order not to introduce audible interferences, then a larger amount of bits is needed for encoding this signal.
In order to achieve a constant output data rate, a medium amount of bits is determined for one section of a signal to be encoded. If the actually needed amount of bits for encoding a section is smaller than the determined number of bits, then the bits which are not needed may be placed into the bit savings bank. Thus, the bit savings bank is filled. If, however, a section of a signal to be encoded is comprised such that a larger number than the determined number of bits is needed for encoding in order not to introduce audible interferences into the signal, then the additionally needed bits may be taken from the bit savings bank. That way, the bit savings bank is emptied. Thereby it may be guaranteed that a constant output data rate is maintained and at the same time no audible interferences are introduced into the audio signal. A precondition for this is that the bit savings bank is selected to be sufficiently large.
In the standard MPEG AAC (13818-7:1997) a bit savings bank is referred to as “bit reservoir”. The maximum size of the bit savings bank for channels with a constant data rate may be calculated by subtracting the average amount of bits per block from the maximum decoder input buffer size. Its value is usually firmly preset to a value of 10,240 bits according to the standard MPEG AAC with a transmission rate of 96 kBit/s for a stereo signal with a sampling rate of 48 kHz. The maximum value of the bit savings bank, i.e. the size of the bit savings bank is sized so that also under bad conditions, i.e. also when the signal comprises many sections which may not be encoded with the determined number of bits, audible interferences need to be introduced into the audio signal in order to maintain the constant output data rate. This is only possible when the bit savings bank is sized sufficiently large so that it is emptied at no time.
On the decoder side this has the following consequence. After the decoder has to consider that both the case of a full bit savings bank and the case of an empty bit savings bank may occur in the course of decoding an audio signal, the decoder needs to buffer a number of bits corresponding to the size of the bit savings bank before it starts decoding at all. Thereby it is guaranteed that the decoder does not run out of bits during decoding the audio signal. If a decoder would immediately decode a signal encoded with the bit savings bank function when it has received the same, then the bits for the output would already run out when the first block to be decoded by accident needed a smaller number than the determined number for encoding, i.e. when the bit savings bank was filled up by the first block. In other words, the bit savings bank function inevitably leads to a delay within the decoder, wherein this delay corresponds to the size of the bit savings bank.
For the preceding example the size of the bit savings bank is 10,240 bits. This leads to an inherent initial delay due to the bit savings bank of about 0.1 s. The delay gets larger, the larger the maximum size of the bit savings bank is selected and the smaller the transmission rate is selected.
If, for example, real-time transmissions of a telephone call are considered, in which a continuous change of speakers takes place, then already due to the bit savings bank a delay of the mentioned size occurs with each change of speaker. Such a delay is extraordinarily disturbing for both communication partners and typically leads to the fact that one speaker, because he does not immediately hear a reaction of the other speaker, that the one speaker repeats the question again, which contributes to a further confusion. Therefore, it is determined that a product designed this way is not suitable for real-time applications and would not have a chance of a breakthrough in the market, respectively.