1. Field of the Invention
The present invention relates to audio coding and decoding, and more particularly, to a scalable audio coding/decoding method and apparatus, which represents data for bitrates of various enhancement layers based on a base layer, instead of forming a bitrate within one bitstream. Further, this invention is closely related to ISO/IEC JTC1/SC29/WG11 N1903 (ISO/IEC 14496-3 Subpart 4 Committee Draft).
2. Description of the Related Art
An audio system stores a signal in a recording/storage medium and then reproduces the stored signal upon a user's request.
By the recent development of digital signal processing technology, the recording/storage media have progressed from a conventional analog type such as an LP or magnetic tape to a digital type such as a compact disc or digital audio tape. The digital storage/restoration method solves deterioration in audio quality and considerably improves the audio quality, in contrast to the conventional analog method. However, there is still a problem in storing and transmitting a large amount of digital data.
To reduce the amount of digital data, DPCM (Differential Pulse Code Modulation) or ADPCM (Adaptive Differential Pulse Code Modulation) for compressing digital audio signals has been developed. However, such methods have a disadvantage in that a big difference in efficiency is generated according to signal types. An MPEG (Moving Picture Expert Group) audio technique recently standardized by the ISO (International Standard Organization), DCC (Digital Compact Cassette) manufactured by Phillips Corp, MD (Mini Disc) manufactured by Sony Corp. and the like use a human psychoacoustic model to reduce the quantity of data.
Such conventional methods have considerably reduced the quantity of data effectively, irrespective of signal characteristics.
An audio coding apparatus which takes the human psychoacoustics into consideration includes a time/frequency mapping portion 100, a psychoacoustic portion 110, a bit allocating portion 120, a quantizing portion 130 for performing quantization according to allocated bits, and a bit packing portion 140, as shown in FIG. 1. Here, the psychoacoustic portion 110 calculates a signal-to-masking ratio (SMR) using human auditory characteristics, particularly a masking phenomenon; a masked threshold, i.e., the minimum magnitude of the signal, which is imperceptible by the interaction with the respective signals. The bit allocating portion 120 allocates bits within the range of limited bits from the part including signals important for audibility using the masked threshold, thereby realizing effective data compression.
In coding a digital audio signal, important human auditory characteristics are a masking effect and a critical band feature. The masking effect is a phenomenon in which a signal (sound) inaudible by another signal (sound). FIG. 2 illustrates the masking phenomenon. For example, when conversation is made in a low voice at a railway station and a train passes through the station, the conversation is not audible due to the noise generated by the train. The magnitude of the perceptible noise may differ between the cases when the noise magnitude is in the range of a critical band and out of the range thereof. Here, the noise is more perceptible in the case when the noise magnitude exceeds the range of the critical band, than the other case.
To perform coding using human auditory characteristics, the magnitude of noises capable of being allocated to a critical band is calculated by these two features, i.e., the masking effect and critical band. Example applications of the digital audio coding method include digital audio broadcasting (DAB), Internet phone, and audio on demand (AOD).
Most of such coding methods support a fixed bitrate. In other words, a bitstream is constructed with a specific bitrate (128 Kbps, 96 Kbps or 64 Kbps, for example). This construction involves no problem when a transmission channel is dedicated to audio data. Since a dedicated channel fixedly supports a specific bitrate, a bitstream constructed with a specific bitrate for the dedicated channel is transmitted to a reception end without error.
However, if a transmission channel for audio data is unstable, it is difficult to properly interpret the data with a fixed bitrate at the reception end. In other words, depending on the state of the transmission channel, the bitstreams of the entire audio data or only some of the bitstreams may be received at the reception end. If only some of the bitstreams are received at the reception end, it is difficult to restore the audio data corresponding thereto, which considerably deteriorates audio quality.
Generally, in the digital audio coding method, a bitstream contains only the information for one bitrate in its header, and the bitrate is maintained. For example, if header information of one bitstream represents a bitrate of 128 Kbps, the 128 Kbps bitstream is continuously used, which is advantageous in representing the best audio quality at the corresponding bitrate. In other words, the optimal bitstream for audio data, such as 64 Kbps, 48 Kbps or 32 Kbps, is formed for a specific bitrate.
However, such a method is very sensitive to the state of the transmission channel. Thus, if the transmission channel is very unstable, correct data cannot be reproduced. For example, when an audio frame is constructed by n slots, if the n slots are all transmitted to the reception end within a given time, correct data can be reconstructed. However, if n-m slots are transmitted due to an unstable transmission channel, correct data cannot be reconstructed.
Also, referring to FIG. 3, the case when data supplied from a transmission end is received at several reception ends will be described. If the capacities of the respective transmission ends of the reception ends are different from one another, or the respective reception ends require different bitrates, the transmission end supporting only a fixed bitrate cannot satisfy the requirement. In this case, if the audio bitstream has separate bitrates for various layers, it is possible to confront a given environment or a user's request appropriately.
To this end, there are three methods for scaling bitrates. First, since information for various layers is sequentially arranged in a bitstream, the bitstream is simply sliced with a desired bitrate to then be transmitted. As shown in FIG. 4, the bitstreams are sequentially constructed from the base layer to the top layer. Then, side information for each layer and audio data are recorded in one bitstream. Therefore, if a user requests only the base layer, the bitstream corresponding to the base layer is transmitted. If the information for the first layer (Layer 1) is requested, only the bitstreams up to the first layer are transmitted. Also, if the information for the top layer is requested, all bitstreams are transmitted.
Second, a device (a converter, for example) reformats the bitstreams between a transmission end and a reception end at a user's request. That is to say, a coding apparatus shown in FIG. 5 forms a bitstream with one bitrate, and then the converter reformats and transmits the bitstream at a lower bitrate at the user's request. At this time, the bitstream formed by the coding apparatus must contain side information so that the converter can form a bitstream of a lower layer.
Third, as shown in FIG. 6, the converter performs reencoding. The reencoding is to form and transmit the bitstream at a bitrate requested by the user through all decoding steps for forming PCM data and all coding steps. For example, when a bitstream of 64 Kbps is transmitted to a main transmission channel and the capacity of a transmission channel for a user is 32 Kbps, the converter installed therebetween forms PCM data using a decoder for 64 Kbps and then operates an encoder for 32 Kbps to form 32 Kbps bitstream to then transmit the data through the transmission channel.
Among the above-described methods, the first method is the most suitable, but this method has a disadvantage of a lowered performance due to the data redundancy in the respective layers. The second method can slightly improve a little the audio quality, compared to the first method. However, formatting the bitstream with a lower bitrate may vary depending on side information transmitted from the encoder. Also, since the procedure passes through the converter, this method is more disadvantageous than the first method in view of delays and cost. In the third method, since the converter serves as a decoder and an encoder, the complexity increases, which makes the procedure costly and causes delays due to reformatting. However, since there is no redundancy in the bitstreams input to the converter, the audio quality in the third method is better than that in the first method. Although it is quite difficult to discriminate between the second method (reformatting) and the third method (reencoding), the third method adopts dequantization in forming a lower bitstream.
In a scalable system, since the converter serves to simply connect the user with the transmission end, the complexity of the converter must be reduced. Therefore, since a less complex converter having no delays and costing less is generally used, a method in which reencoding is not used has been adopted.
Generally, to form the bitstream as in the first method, as shown in FIG. 7, coding is first performed for lower layers, decoding is then performed, and a difference between the original signal and the decoded signal is input to an encoder for the next layer to then be processed. This method generally adopts at least two coding methods. That is to say, a core codec for generating a base layer is used together with another codec for generating other layers. However, such a method increases the complexity of a coding system due to the presence of at least two encoders. The complexity of a decoding system also increases with a plurality of decoders. Also, the more the layers, the more complex coding becomes. This is because correct temporal domain data of the corresponding layer can be obtained by summing the respective temporal domain data generated for the respective layers.