This invention is related, in general, to the art of coding digital signals, and more particularly, to a method and a system of controlling coding modes of multimode coding systems for coding speech and music signals.
In current multimedia applications, audio data streams carry both speech and music signals. Even within a signal type, there are distinct categories of signals. For example, during certain types of speech, the audio signal exhibits a highly periodic signal structure. This type of signal is called voiced signal. On the other hand, a speech signal may exhibit a random structure. Such a signal lacks periodic structure, or pitch, and is termed an unvoiced signal. At certain points in a speech signal, the signal may show only continuous background noise or silence. Such a signal is termed a silence signal. In addition to the above types of speech signals, there also exist transition regions in a typical speech signal wherein the signal is changing from one type, such as unvoiced, to another, such as voiced. In such a region, the signal typically demonstrates one or more large signal spikes on top of a background signal.
Humans have a finite perceptual capability with respect to audio signals, and errors or noise in signals of different types may be perceived more or less strongly depending upon the base signal type. This is true not only for speech signals but also for other audio signal types such as music signals.
Some current coding technologies enable a coding system to code audio signals with different modes, for example, speech mode for coding speech signals and music mode for coding music signals. In the coding of audio signals, input signals are typically first digitized into signal samples, and the signal samples are grouped into signal frames. Before actual coding of a frame begins, the frame may be analyzed. Thereafter, the frame is encoded into a bit-stream using the appropriate coding mode, wherein a number of coding bits are allocated for coding of the signals in each frame. The coded bit-streams are transmitted, such as via a network, to a remote coding system, which converts the bit-streams back into audio signals. Alternatively, the coded signal may be stored. Whether the signal is to be transmitted or stored, the coding process typically is adapted to attempt to minimize the amount of data used to effectively code the signal, thus minimizing the required transmission bandwidth or storage space.
For the most part, multimode coding systems employ fixed rate coding techniques. Such coding systems are inefficient in that they do not take advantage of the finite human perceptual capability to allocate the usable data capacity. More recently, variable-rate coding strategies have received intensive study and some of these strategies provide gains over the fixed rate methods.
A typical variable-rate coding technique takes advantage of the nature of human aural perception by using a minimum number of bits to code the signal without substantially impacting the perceptual quality of the reconstructed audio signal. In this way, high perceptual quality is achieved while using a minimum number of bits.
Most existing variable-rate coding systems are optimized for short end-to-end delay, such as may be required in many real-time applications. However, there are delay-insensitive applications such as Internet streaming, books on tapes, etc. Existing coding mechanisms used for these applications do not take advantage of the longer permissible delay, and as such do not minimize the average coding rate to the greatest extent possible.
The present invention provides a method and a system for use in a multimode coding system for minimizing the amount of data needed to transmit and/or store a coded representation of an audio signal. The coding technique employed to encode an interval of an audio signal is selected according to the characteristics of the current audio frame, as well as the statistical characteristics of a current sequence of audio frames, as well as the status of a bit-stream buffer provided for buffering the encoded bit-stream. A coding delay is effectively utilized to optimally allocate available average transmission or storage capacity for a sequence of frames, so that more capacity is available when needed for signals of higher complexity, while less capacity is utilized to code signal intervals that are perceptually less significant.
In an embodiment of the invention, a set of audio signal classes are defined based on possible intrinsic characteristics of the input audio signals. Each of the classes is then associated with an expected coding rate according to the relative importance of the signals of that class to the perceptual quality of the audio signals. The available coding rates will be based on a required average coding rate that is related to the configuration of the coding system and the environment that the coding system is operated in. Thus, audio signals of a particular class are expected to be coded at a particular coding rate associated with the particular class.
A sequence of input audio signal samples are queued in a look-ahead buffer as a sequence of audio frames, each frame consisting of a number of audio signal samples. Based on statistical characteristics of the audio signals therein, each frame is classified into one of the defined classes. The classified frames are then sequentially encoded by a multimode encoder, with each frame being encoded at a rate that is as close as possible to a target coding rate. The target coding rate is obtained by adjusting the expected coding rate, wherein the amount of the adjustment is determined with respect to the sequence of frames and the status of the bit-stream buffer. In determining the target coding rate, issues of overflow and underflow of the bit-stream buffer are addressed. Those of skill in the art will appreciate the correspondence between bits and bits per second, or xe2x80x9crate.xe2x80x9d Accordingly, when the terms xe2x80x9cbit(s)xe2x80x9d and xe2x80x9crate(s)xe2x80x9d are employed herein, those of skill in the art will appreciate that they may easily convert from one to the other by accounting for the time over which the bits are processed or transmitted as the case may be.
In a first example, a sequence of speech signals is received in a time interval, and are queued up in a look-ahead buffer as a sequence of speech frames. Each speech frame is then classified into one of four predefined classes: voiced frame, unvoiced frame, silence frame, and transition frame. Each class is associated with an expected coding rate. Voiced and transition frames are more complex and are thus associated with relatively high expected coding rates, while silence and unvoiced frames are associated with low expected coding rates.
In determining a target coding rate for a current coding frame, the class distribution over all classified frames in the look-ahead buffer is studied, the current status of the bit-stream buffer is observed and an expected status of the bit-stream buffer after coding all classified frames in the look-ahead buffer at their respective expected coding rates is estimated. Thus the determined target coding rates effectively avoid overflow and underflow of the bit-stream buffer. Given the determined target coding rate, a coding rate is selected from the available rates of the coding system to approximate the target coding rate.
In a second example, a sequence of music signals is received by the multimode encoder. Similar procedures to those for estimating a target coding rate for speech signals are employed herein for music signals. For example, a music signal can be classified as transient music, stationery music, etc. However, the coding of music signals differs from the coding of speech signals in that, for music coding, the available coding rates of the multimode encoder vary continuously. Therefore, the target coding rate, rather than some approximation, is selected for coding a current music frame.