Speech and audio coders typically encode signals using combinations of statistical redundancy removal, perceptual irrelevancy removal, and efficient quantization techniques. With this combination, the majority of advanced speech and audio encoders today operate at rates of less than 1 or 2 bits/input-sample. This often means that many parameters are quantized on average at very low rates below 1 to 2 bits/parameter. At such low rates, there can be challenges in particular in the quantization and irrelevancy removal steps.
The quantization step refers to the process of converting parameters that represent the speech or audio into one or more finite sequences of bits. A parameter can be quantized individually. For purposes herein, it is represented by a sequence of bits that contain no information on other parameters. If a parameter is represented by “s” bits, then there are at most 2s alternatives one could consider for the representation. Such alternatives may be compiled in what is known as a “codebook”. For single parameter quantization, the entries of the codebook are scalars that represent the different alternatives for representing the original parameter.
Parameters can also be quantized jointly whereby a sequence of bits refers to a group of two or more parameters. In such a case, codebook entries are multi-dimensional entries, with each being a representation of multiple parameters. One realization of this process is a “Vector Quantizer”. Joint quantization often leads to more efficient quantization, though often there can be complexity penalties since now the number of bits “s” is larger given it is the sum of bits over all parameters.
The bits generated by quantization are sent to the decoder and are used to recover an approximation to the original speech/audio parameter(s). When the approximation to this parameter differs from the original parameter, the difference can be considered as noise added to the original parameter. This noise is the quantization noise referred to herein.
For audio and speech, such quantization noise may be perceived on playback as a distortion in the signal. This is because the decoded signal is in general different from the original signal because the quantized parameters are different from the original parameters.
Note, the signal parameters that are actually quantized can take many forms. Some of the most popular parameters used are frequency-domain samples/coefficients, e.g., as obtained by either a frequency-domain transform like a Modified Discrete Cosine Transform (MDCT) or filter-bank, and/or time-domain samples/coefficients. In such cases, the noise is perceived as distortion effects in different time and/or frequency regions.
The process of irrelevancy removal refers to the process whereby the noise is given a desired characteristic so that it is either not, or with minimal effect, perceptible on playback. For example, the noise may be at a low enough level that the human auditory system is not able to notice it during playback.
Note, in one realization of part of such an irrelevancy removal process, one can ignore some parameters entirely in the quantization process. This is the case in which zero bits are sent for the parameter(s). At the decoder, such a parameter is either ignored in the decoding process or set to some known fixed or random value. In all cases, there is quantization noise introduced into this parameter by ignoring such a parameter.
Irrelevancy removal can also be the process of directing and sending a sufficient approximation to the original parameter, i.e. deciding on and sending the correct number of bits, so that the noise is at a given desired level and thus the desired perceptual effect is achieved during playback.
The process of redundancy removal refers to the process of creating a parameter representation that allows for an efficient quantization of the signal. For example, the representation may facilitate an efficient distribution of bits to different parameters. For example, some representations concentrate the original signal energy into as few parameters as possible. Representations such as the MDCT have such a property when applied to many audio and speech signals. This allows bit resources to be concentrated into a few parameters with other less important parameters receiving less or no bits.
This MDCT representation (and similar types of frequency domain representations) also has an added benefit because it represents the frequency content in the audio signal. Perceptual distortion as a function of frequency content is a subject studied in great detail. Therefore, such representations are also useful for irrelevancy removal.
In designing a good audio/speech coder, there are strong inter-dependencies in the relative effectiveness of the quantization, redundancy removal and irrelevancy removal processes. For example, in selecting a quantization option (if there are many to choose from) one may try to predict what type or level of noise the quantization process may generate. For example the expected (average) noise each quantization option will introduce could be used to predict the potential perceptual effect each of the options may have. This can lead to a process whereby coding (quantization) decisions/options are selected up-front, before the quantization step, in a signal adaptive manner based on average expectations.
Decisions generally can be made up-front if one expects the quantization process to have a good or generally “well behaved” predictable outcome. For example, a designer may know ahead of time that the encoder has enough bits to quantize the signal sufficiently well so that the quantized signal will have, or often have, a very low, if not imperceptible, amount of quantization noise. Such a well-behaved scenario may be, for example, the situation of quantizing a signal at a sufficiently high bit rate. It may be a scenario where the audio signal is such that it can be represented with a small number of parameters. In such cases, the processes of quantization, redundancy removal and irrelevancy removal can work semi-independently knowing that each is able to reach their respective desired outcomes.
For example, in such a scenario, the irrelevancy removal process may direct the quantization process using a pre-calculated perceptually relevant “noise threshold”. Some audio coders calculate, before the parameter quantization step, a “perceptual noise threshold” (set of upper-bound values) that quantization noise must adhere to for each parameter, e.g. each MDCT coefficient must not have noise exceeding its respective threshold. This threshold (often a vector of values) specifies for each parameter the desired limit on the quantization noise for the parameter. Knowing ahead of time that such a threshold is often achievable makes such an approach feasible.
One refinement to this process involves minor modifications to this threshold if by chance the encoding does not successfully attain the threshold for any parameter. Take for example the case where a group of parameters has to achieve a noise threshold (upper-bound) of “Delta”, and the coder only has “b” bits to do so. One such process is illustrated in FIG. 1A. If one uses a uniform scalar quantizer with step size “Delta,” the quantization step assigns for each parameter an integer that specifies how may “Delta” steps it takes to give a good approximation of the value. For example, if a parameter has value −1.33, and Delta is 0.50, one could specify that it will take negative three “Delta” steps to approximate the signal. Here the representation of the original parameter is −1.5, and the noise level is the absolute value of difference between −1.50 and −1.33, i.e. 0.17, which is less than Delta.
In the example mentioned above, the numerical index to which the original parameter is mapped to is −3. This number is then mapped to a sequence of bits. In this case, one can either map indices to a fixed number of bits, e.g. 3 bits would be sufficient to represent 8 unique integer values such as −3, −2, −1, 0, 1, 2, 3, 4. Or a variable number of bits could be used, exploiting the fact that some integer values are used more frequently, e.g. as done in Huffman coding, where each variable bit representation can be uniquely parsed from the stream. Such techniques are known widely by those skilled in the art of audio coding and are in fact used frequently in audio coder designs.
However, the main issue is that often the number of bits needed to ensure the noise on each parameter is less than “Delta” is often not known until all the parameters are coded. Often, the number of bits used can be variable if variable length coding techniques such as Huffman coding are used. It can be that at the end of quantization with respect to “Delta” the number of bits exceeds the maximum “b” the encoder has for the process.
To solve this problem at times one can make a slight modification to the threshold (e.g., increase the acceptable noise level by a factor), and re-code. Referring to FIG. 1A, an audio coder may test different levels “Delta”, in particular an increasing sequence of Delta values, to find a value that achieves an acceptable total number of bits “n(1)+n(2)+ . . . +n(N)”. In general, a larger “Delta” requires a fewer total number of bits. This classic iterative process often is termed a “rate-loop” in some audio coder designs. It makes sense only if such slight modifications to the original threshold also result in a meaningful new (easier to attain) perceptual threshold.
However, as mentioned, such processes may be only attractive when the coding steps, in particular quantization, are well-behaved. At very low bit-rates, accurately predicting the exact joint behavior of the three processes ahead of time, in particular the joint behavior of the irrelevancy removal and quantization steps, may be difficult. One reason for this is the potentially very high levels (and randomness) of the noise introduced by the quantization process at low rates. If, indeed, the actual quantization noise introduced is both very random and at a high level for a given quantization option, an accurate assessment of the true perceptual effect of a quantization option may not be possible until after quantization. In particular, the perceptual assessment has to be done considering noise that varies in level from parameter to parameter above the threshold. In fact, in such cases, simple modifications to an original target perceptual threshold, such as increasing “Delta”, may not make sense. Specifically, there may be no single target perceptual threshold or set of thresholds that one could easily pre-determine to be relevant to the final quantization outcome. It means that some classical approaches of selecting options apriori based on expectations (average behavior) and predictions may not be efficient. The dependence and complications of perception are discussed in more detail below.
As mentioned above, the processes of statistical redundancy removal, irrelevancy removal and quantization are quite inter-dependent. It should be mentioned that it is not necessarily easy to fix this issue by simply improving the redundancy removal step. For example, if the redundancy removal step is very efficient it often means that most of the signal representation is distilled into a few parameters. For example, most of the energy of the original “N” speech/audio signal parameters is now concentrated mainly into “T” new signal parameters by this step (where T is much less than N). When this happens, it helps the quantization and irrelevancy removal steps, but at low rates, often one cannot quantize all the new “T” parameters to a very high fidelity. While one can consider multiple redundancy removal options, in the end the joint operation of irrelevancy removal and quantization is very important at low rates.
Perceptual principles guide the irrelevancy removal step and thus quantization. With such principles, a prediction as to how noise will be perceived for each parameter, or jointly across many parameters, may be made. One realization of such a process is the “absolute perceptual threshold” which is very relevant to the approach mentioned previously. In this case, in low noise levels, one may simply have to calculate a threshold that reflects decisions as to whether or not the human auditory system can perceive noise above/below such selected level(s). This level(s) is signal adaptive. In such a case, the perceptual threshold specifies a set of quantization noise levels for parameters below which noise is not perceived, or is perceived at a very low acceptable level. Since level for each parameter represents the point of making a binary decision, it simplifies greatly the computation. Quantization is simplified since it only has to ensure the levels are not violated, or violated only infrequently, to result in a desirable encoding of the speech or audio signal. However doing calculations to generate such a “absolute perceptual threshold” for even such assumed low targeted noise levels can already be very computationally intensive.
Calculating the perceptual effect for higher levels of noise, noise that will violate strongly the “absolute perceptual threshold” for one or more parameters, is more complex since not only does one have to make a determination if the noise is perceived, but also how and/or to what level it is perceived. This situation is the situation of “Supra-Threshold” noise, i.e. noise above the threshold of perception. In this case, the exact levels of noise achieved for each parameter are important beyond simply their relation to the absolute threshold. Also, supra-threshold noise on one parameter often interacts perceptually with noise from a different parameter, in particular if the noise they introduce is sufficiently close in time and/or frequency. Thus one cannot often determine accurately the perceptual effect of Supra-Threshold noise until after quantization. It implies that when operating in the “Supra-Threshold” region parameters cannot be independently quantized, e.g. quantized in a manner such as testing each relative to its own “threshold”.
With a coder in which quantization noise conforms to an “absolute perceptual thresholds,” a coder can calculate a perceptual threshold or target set of levels in the irrelevancy removal step before the quantization process. The threshold is then used as a target for the quantization process without knowing ahead of time what the quantization process will achieve. This is a realization of what is known as an “Open Loop” process. Thus, this process has the advantage that some decisions are made up-front (given the mathematical complexity) and never revisited, or are only modified in simplistic ways such as raising a threshold. For purposes herein, this is referred to as an “Open Loop Perceptual Process” to distinguish from other processes that can also be Open Loop.
However at low bit-rates, as mentioned before, it can be difficult or impossible to accurately predict ahead of the quantization process the exact joint performance of the irrelevancy removal and quantization steps. The “Open Loop Perceptual” process is less attractive in this scenario. This is because the noise is now perceptible, i.e. supra-threshold as mentioned previously, and the quantization process can behave in very random ways, and good quantization by nature has to be a joint encoding of parameters. In this case, the exact level, or an accurate estimate of the level, of the quantization noise often needs to be known before a perceptual determination of performance can be made. The difficulty is compounded by the inherently high levels and variability of the noise introduced by the quantization process at low bit-rates. Given this, any prior estimate of the introduced noise may be of little use since the estimate may often be inaccurate.
Note that if estimates of expected levels are not possible, one could also use the worst-case value, which can lead to over-conservative decisions and further inefficiencies.
To solve this problem, a “Closed Loop” processes is used. In this case, multiple assumptions are made and/or multiple quantization options are performed, and each assessed perceptually after the quantization step where it is known what quantization noise results from each option.
In this case, in a “Closed Loop Perceptual Process,” one could test all quantization options, calculating the exact noise each option produces, and then select the one with the best perceptual advantage. Some coders to do just that. For example, one could use a number of different heuristics to modify an underlying perceptual threshold and/or use a number of different quantization representations and hope that one produces a combination where the quantization step achieves the target threshold.
In fact, at the extreme, for a given number of bits “b” allocated to a group of parameters, there are potentially up to “2b” threshold and/or quantization options one could consider, each possibly with a very random and un-predictable noise pattern, and thus perceptual effect, for a given signal. However, for computational complexity reasons, testing all quantization options and their actual perceptual effects is often not practical.
For example, quantizing 40 parameters at 1 bit/parameter means there can be up to 240 options. Consider that audio coders are often quantizing many thousands of parameters a second, and for each option, in the extreme, a perceptual assessment may have to be done on all groups since all have high “Supra-Threshold” noise levels.
Because of these reasons, a “Closed Loop Perceptual Process” design by nature cannot be an exhaustive search on “2b” independent alternatives
One way to use a Closed Loop process is to greatly simplify the complex supra-threshold model. One way to do this is to replace the supra-threshold model by simple approximate criteria. One such type of criteria used often is signal adaptive weighted mean square error (WMSE) distortion criteria. This is what is done in many speech coding designs, e.g. the Algebraic Code Excited Linear Prediction (ACELP) designs used in ITU-T Rec. G729 and other ITU-T and ESTI standards. With simplified MSE-like criteria coders can use classic MSE-based procedures for searching classical vector quantization codebooks. Such codebooks, like “Algebraic Structured” codebooks, or “Tree”, “Product” or “Multi-Stage” vector quantizers, are designed to be able to search “2b” alternatives efficiently by discarding a large fraction of the 2b alternatives in the search process.
In this case, however, many vector quantization structures often do not make very explicit links to how noise may be allocated to different parameters. It is often a blind design relying on the WMSE criteria to help sort out the possibilities. So while the complexity of the search process can be reduced by structure in the codebook design effectively a non-trivial fraction of the “2b” alternatives have to be tested. For example, in a two-stage codebook design with b/2 bits at each stage, one still has to consider on the order of 2b/2+2b/2 alternatives. That is, without explicit control of noise within the codebook design, to ensure efficient quantization, one needs to ensure sufficient numbers of alternatives are considered and searched. This necessitates the use of a simplified perceptual criteria, such as Mean Square Error based measures, to enable this search, and much work in the field is spent on coming up with designs that do a search efficiently yet still perform well, even with a WMSE criterion. Designs that perform well with more accurate and complex criteria often are not, and cannot, be considered.
It should also be noted that when coders use a weighted mean square error (WMSE) measure the measure implicitly assumes that the actual noise, in the end of the search, is distributed as the weighting directs, with areas weighted more heavily hopefully directed to having less noise. However, in practice, the exact level of the noise for different parameters may or may not follow the general trend that is hoped for by the weighting, in particular at low rates.
See the example in FIG. 1B. The weighted measure, and the design of the codebook for such a measure, simplifies and hides the precise effect of individual noise levels through the use of a summation (within the MSE criteria) that expects the noise to approximately behave as desired.
The number of search possibilities has been reduced in at least one prior art implementation which will be discussed later. In contrast, the codebook structure in ACELP and other classic vector quantizer designs can not be used with complex perceptual criteria even though its structure allows for searches that effectively reduce the number of alternatives to less than 2b. By nature, the search only works efficiently when coupled directly with MSE-like criteria. An example of an ACELP-based search mechanism that operates used in ITU-T Rec. G.729 whereby 40 residual time samples are jointly quantized with a signal adaptive WMSE criterion.
It is also important to re-iterate that most “rate loop” searches within audio coders deal with the issue of bitrate, and only weakly with optimizing perceptual performance since an “absolute perceptual threshold” is modified necessarily by simple means in the rate look. Here the rate-loop does have a “Closed Loop” element, but by nature the search is more about rate-distortion optimization than carefully optimizing the resulting supra-threshold perceptual effects of the now perceptible quantization noise. Such effects can only be accurately predicted after the exact noise levels are known and are not simply assessed by checking noise levels against thresholds.
In short, both classical approaches above in speech and audio coding can have:                a) inherent inefficiencies as they simplify the distortion metric, and/or e.g., using a WMSE though true perception is more complex        b) overly conservative constraints limiting options e.g., imposing a maximum uniform level within a scale-factor band, and/or        c) overly conservative assumptions on the noise levels and/or e.g., using the maximum level rather than the actual or “closer to actual” mean level        d) errors between their intended and actual noise allocations, e.g.,                    a. errors are not distributed with shapes/characteristics one may assume by a using WMSE criterion,            b. errors may in fact vary so much that expected or predicted levels may have very little use.                        e) very little explicit control of the noise level assigned to individual parameters when jointly coding multiple parameters by vector quantization or structured codebook representations.        
This can happen especially when operating at low bit rates. As a result, there are inefficiencies when coders attempt to link perceptual performance with predictions, or use simplistic assumptions when directing quantization.
Recently, there is a class of new quantization options, termed partial-order quantization schemes which have the property of being able to create purposefully non-trivial patterns of bits allocations (and thus estimated noise allocations) across a vector of parameters.
For a “b”-bit quantization scheme, a proto-type pattern “P” is used to generate 2c<<2b possible patterns, all related by a limited permutation of the proto-type pattern, much like a permutation code, though, in this case, one permuting bit-assignments not elements of codewords as the classic “Permutation Codes”. For example, a pattern “P”
P=p(1),p(2), . . . ,p(N)
has elements “p(j)” that each define how a particular parameter from the “N” total parameters is to be quantized. One may often consider only a subset of such permutations, e.g. maybe just two such permutations as:
p(2),p(1),p(3),p(4),p(5), . . . ,p(N) and p(3),p(1),p(2),p(4),p(5), . . . ,p(N)
One motivation for limitation of the permutations (partial ordering) comes from the fact that often p(j)=p(i) for some i and j, thus making some permutations equivalent. For example, in the above, if p(1)=p(2)=p(3), then the two above patterns are the same and would not be distinguished as different permutations.
More generally, one can limit the permutations for other reasons, e.g. permutations that, for instance, concentrate (or spread) higher values p(j) in the new pattern (permutation). In the case that “p(j)” is a bit-allocation, it has been shown, in fact, that at low bitrates using such non-trivial patterns can be more efficient than other quantization techniques that either create equal patterns of bit allocations (where p(i)=p(j) for all i,j).
Such equal patterns of bit allocations can equivalent to equal patterns of estimated noise allocation. For example, if p(i)'s are noise allocations, then p(i)=p(j)=“Delta” is an assignment that creates a target similar to that in FIG. 1. In all cases the number of unique permutations, 2c, considered is less than (often much less than) N!
If the patterns are bit-allocations, and the quantization process of each parameter is constrained to use the given number of allocated bits for a parameter, then the total number of bits used by the allocation is known ahead of time, i.e. the pattern uses p(1)+p(2)+ . . . +p(N) bits. Therefore, there is no uncertainty in the number of “Deltas” used, and thus bits spent, as in the process in FIG. 1A.
The procedure also has simplifications in searching for good permutations. One way to do implement the quantization procedure is not to permute the bit (or noise) allocation but to permute to a target vector X=x(1), x(2), . . . , x(N), keeping the quantization pattern P=p(1), p(2), . . . , p(N) fixed. The term “partial order” arises from the fact that it is often good to permute the x(j)'s by partially ordering the x(j)'s in terms of energy of perceptual relevance.
It has been also shown that if one considers multiple proto-type patterns, e.g. g=2d patterns P(1), P(2), . . . , P(g), where with pattern P(k) generating itself 2c(k) patterns related by a partial order (limited permutation), performance can be further improved. For example,
Pattern 1: P(1)=p(1,1), p(2,2), . . . ,p(N, 1)
Pattern 2: P(2)=p(1,2), p(2,2), . . . , p(N,2)
. . .
Pattern g: P(g)=p(1,g),p (2,g), . . . , p(N,g),
where p(i,j) (like the p(i) in the previous example) is a value specifying how to quantize a parameter. To ensure that “b” bits was spent on quantization, then
d+c(k)+p(1,k)+p(2,k)+ . . . +p(N,k)=b for all patterns k=1,2, . . . ,g
Furthermore, for a given pattern P(k), one can identify with little computation (or very little beyond an absolute perceptual threshold calculation) which of the 2c(k) patterns has the best perceptual advantage.