Speech and audio coders typically encode signals by a combination of statistical redundancy removal and perceptual irrelevancy removal followed by quantization (encoding) of the remaining normalized parameters. With this combination, the majority of advanced speech and audio encoders today operate at rates of less than 1 or 2 bits/input-sample. However, even with advancements in statistical and irrelevancy removal techniques, the bitrates being considered, by definition, often force many normalized parameters to be coded at rates of less than 1 bit/scalar-parameter. At these rates, it is very difficult to increase the performance of quantizers without increasing complexity. It is also very difficult to control or take advantage of the perceptual effects of quantization and/or irrelevancy removal since the granularity of bit-assignments (resource assignments) and the performance of quantizers are limited, in particular when bits are assigned equally among statistically equivalent parameters.
Much of the compression seen in advanced coder design, including design of audio and speech coders, is due to a combination of the early stages of encoding where redundancy and irrelevancy are efficiently encoded and/or targeted for removal from the signal, and the latter stages of encoding which use efficient techniques to quantize the remaining statistically normalized and perceptually relevant parameters.
At low bit rate, the stages of redundancy and irrelevancy removal must be efficient. There are a number of examples of how the stages of redundancy and irrelevancy removal are made efficient. For example, the stages of redundancy and irrelevancy removal may be made efficient using a Linear Predictive Coefficient (LPC) Model of the gross (short-term) shape of the signal spectrum. This model is a highly compact representation that is used in many designs, e.g. in Code Excited Linear Predictive Coders, Sinusoidal Coders, and other coders like the TWIN-VQ and Transform Predictive Coders. The LPC model itself can be efficiently encoded using various state of the art techniques, e.g., vector quantization and predictive quantization of Line Spectral Pair parameters, etc.
Another example of how the stages of redundancy and irrelevancy removal may be made efficient is using compact specifications of the harmonic or pitch structure in the signal. These structures represent redundant structure in the frequency domain or (long-term) redundant structure in the time domain. Common techniques often use a parameter specifying the periodicity of such structures, e.g., the distance between spectral peaks of frequency domain representations or the distance between quasi-stationary time-domain waveforms, using classic parameters such as a pitch delay (time domain) or a “delta-f” (frequency domain).
An additional example of how the stages of redundancy and irrelevancy removal may be made efficient is using gain factors to explicitly encode the approximate value of signal energy in different time and/or frequency domain regions. Various techniques for encoding these gains can be used including scalar or vector quantization of gains or parametric techniques such as the use of the LPC model mentioned above. These gains are often then used to normalize the signal in different areas before further encoding.
Yet another example of how the stages of redundancy and irrelevancy removal may be made efficient is specifying a target noise/quantization level for different time/frequency regions. The levels are calculated by analyzing the spectral and time characteristics of the input signal. The level can be specified by many techniques including explicitly through a bit-allocation or a noise-level parameter (such as a quantization step size) known at the encoder and at the decoder or implicitly through the variable-length quantization of parameters in the encoder. The targets levels themselves are often perceptually relevant and form the basis for some of the irrelevancy removal. Often these levels are specified in a gross manner with a single target level applying to a given region (group of parameters) in time or frequency
Once these techniques have reached to limit of their capabilities, e.g. in the extreme case where they have completely normalized the signal statistics and created a bit-allocation or noise-level parameter allocation on these normalized parameters, the techniques can no longer be used to further improve the efficiency of encoding.
It should be noted that even with the best of the fore-mentioned redundancy and irrelevancy techniques the normalized parameters may have variations within them. The presence of variations in subsequences of parameters is well known in some engineering fields. In particular, at higher parameter dimensions, the variations have been noted in fields such as Information Theory. Information Theory notes that subsequences of statistically identical scalars (random variables) can be divided into two groups: one group in which the subsequences conform to a “typical” behavior based on a relevant measure, and another “atypical” group in which the sequences deviate from that “typical” behavior based on the same measure. A precise and complete division of sequences into these two groups is required for the purposes of theoretical analyses in Information Theory.
However, one observation used by Information Theory is that the probability of encountering these latter “atypical” sequences becomes negligible as the subsequences themselves increase in length, i.e. dimension. The result is that the “atypical” subsequences (and their effect and precise handling) are discarded in asymptotic theoretical analyses of Information Theory. In fact, the theoretical analyses use a very inefficient handling of these “atypical” subsequences, the inefficiency of which is irrelevant asymptotically. At lower dimensions, the main issue is whether or not these variations are significant enough to merit more careful handling, or whether they can or should also be ignored.
Local variations in signal statistics have been implicitly (indirectly) handled previously using higher dimensional vector quantizers, e.g. a quantizer with dimension that can be as large as the entire length of the sequences being considered. Therefore while the codewords in a high-dimensional quantizer may, or may not, reflect some of the local average variations within the sequence, there is no explicit consideration of these variations. There are many approaches to using higher dimensional vector quantizers. The most basic is the straight-forward (brute-force) approach of generating a quantizer whose codebook consists of high-dimensional vectors. This is the most complex of the approaches but the one with the best performance in terms of rate-distortion tradeoffs.
There are also other less complex approaches that can also be used to approximate the straight-forward high-dimensional quantizer approach. One approach is to further model the signal (e.g. using an assumed probability marginal density function) and to then do the quantization using a parameterized high-dimensional quantizer. A parameterized quantizer does not necessarily need a stored codebook since it assumes a trivial signal statistic (such as a uniform distribution). An example of a parameterization is a Trellis structure. Such structures also allow for easy searching during encoding. There are also a multitude of other techniques known as structured quantizers.
There are also methods to more directly handle variations within a target vector of interest. There are numerous methods that are used to examine a target vector and produce criteria on how the vector should be encoded. For example, a MPEG type coder takes a vector of MDCT coefficients, analyzes the input signal, and produces fidelity criteria for different groups of MDCT coefficients. Generally, a group of coefficients span a certain support area in time and frequency. Coders like the transform predictive coder and basic transform coders use information of signal energy in a given subband to infer a bit-allocation for that band.
In fact, the creation of criteria is the basis for most speech and audio coding schemes that adapt to the signal. The criteria's creation is the function of earlier stages of the coding algorithm dealing with redundancy removal and irrelevancy removal. These stages produce fidelity criteria for each target sequence “x” of parameters. A single target “x” could represent a single subband or scale-factor band in coders. In general, there are many such “x” in a given frame of speech or audio, each “x” having its own fidelity criteria. These fidelity criteria themselves can be functions of the gross statistical and irrelevancy variations noted by earlier schemes.
Statistical variations within a sequence of normalized vectors can be exploited by using variable-length quantization, e.g. Huffman codes. The codeword assigned to each target vector during quantization is represented by a variable-length code. The code used tends to be longer for codewords that are used less frequently, and shorter for codewords that are used more frequently. Essentially, the situation can be that “typical” codewords are represented more efficiently and “atypical” codewords less efficiently. On average the number of bits used to describe codewords is less than if a fixed-length code (a fixed number of bits) is used to represent codeword indices.
Finally, in recent work, there is discussion about the balance between specifying the only values within a sequence of variables with no information on the order (location) which they occur, and specifying only the order with no information on the values. More recent work, the idea of specifying only “partial information” on the order is also alluded to. The work does show that ignoring either types of information can have benefits, once you can justify that either the order or values of variables is not important. In work on speech and audio coders, both the order and value are important, though it could be that different values have different levels of importance. This is not addressed in the referenced work. For more information, see L. Varshney and V. K. Goyal, “Ordered and Disordered Source Coding”, Information Theory and Applications Workshop, Feb. 6-10, 2006 and L. Varshney and V. K. Goyal, “Toward a Source Coding Theory for Sets”, Data Compression Conference, March 2005.