The present invention relates in general to image or video processing. More specifically, the present invention relates to the process and/or coding of digital images using transforms.
Bandwidth and Compression
Digital image processing is the process of analyzing an image expresses in digital form to determine something about the image or to create a processed version of the input image. An image can be defined as an N-dimensional array of pixel values where each pixel represents a spatial sampling point associated with intensity or color value distributions. Typical examples are the 2-D still images encountered in photographs or as individual frames of a video sequence, or the 3-D images produced in a wide range of applications from medical imaging to surface digitization to holograms. The size of an N-dimensional image is characterized by the amount of spatial sampling (resolution) of the image as well as the number of possible color values (color depth).
For example in the case of N=2 with a width of 720 pixels, a height of 480 pixels, and a color depth of 16 bits (meaning 2^16 possible color values) the resultant image size is 720×480×16 bits, or 5,529,600 bits or approx. 700,000 bytes. This is the size of an individual video frame (two fields) in the common NTSC video format used on television sets throughout North America. In the very same NTSC video format the size of the data is further magnified by the display rate of 30 frames per sec. This amounts to over 165 million bits of bandwidth or approx. 20 million bytes over every sec of raw video.
Many of the problems in image and video processing relate to methods for the efficient handling of such large data sets. As the transmission and storage of raw digital images, and especially video sequences, at such enormous rates is infeasible with conventional computer equipment and/or network connections available to the mass consumer, there is a significant need for efficient compression of said images or video sequences.
Lossless Compression
In the case of still images this can be accomplished through a variety of techniques. If the application requires lossless compression, i.e. no reduction of quality as a result of the compression method, suitable options include the Lempel-Ziv dictionary based schemes (e.g. LZ77, LZW), loss-less JPEG or JPEG 2000, entropy encoders such as arithmetic encoding, or various hybrid approaches such as the commercially available PKZIP and GZIP. All lossless methods work on the premise of removing unnecessary redundancy in the data set while allowing for perfect reconstruction on the decoder side. All lossless techniques, however, suffer from the same two severe deficiencies: (a) the compression ratios are generally relatively small for most still images and (b) when used alone the performance gain is greatly affected by the nature of the input data set thus making it intractable to guarantee a constant output rate, which may be required for transmission over a given channel.
Lossy Compression
Lossy compression methods are a suitable alternative to the aforementioned lossless still image compression methods. All lossy techniques will affect the image quality in some manner. At high enough bit rates it is arguable whether the effects on quality will be perceptually meaningful. However, in order to achieve significant compression gains it becomes necessary to reduce the image quality. The problem then is how to efficiently compress the image in such a way that the required loss in image quality is acceptable in the sense of not too much perceptual degradation. Examples of lossy still image compression include, but are not limited to, pixel sub-sampling, color depth quantization, fractals, vector quantization, and transform coding. Only the last two categories, especially transform coding, have demonstrated high enough compression gains at high enough quality over a wide range of image types to be commercially viable in applications requiring still image compression.
Transform Coders
General Description
From here on we will focus our discussion on the use of transform coders as pertains to image compression, though such techniques are also commonly used in solutions to many other image processing related problems. Examples of two of the most popular lossy still image transform encoders are the publicly available JPEG and JPEG2000 compression standards. As mentioned previously these two methods can also be run in a lossless mode. A transform coder generally consists of two parts: (a) a decomposition of the multidimensional signal onto a specified set of basis functions and (b) the process of quantization followed by some manner of lossless encoding. We will primarily focus on a discussion of the first part, i.e. the transform decomposition, in section entitled ‘Transform Coding’. However, it should be noted that the second part, i.e. quantization plus lossless encoder (or in some cases bit-plane encoders), is required in order for the technique to be considered ‘lossy’.
Video Residue Encoders
Lossy transform coders are also directly applicable to video compression. As noted at the beginning of the introduction the bandwidth requirements imposed by raw video at high frame rates are particularly daunting. However, treating a given video sequence as a set of independent still image frames and thus applying lossy still image compression techniques to each frame on an individual basis are inherently inferior to modern video codecs. The distinction lies in that all modem state-of-the-art video codecs seek to exploit the existence of a large amount of temporal redundancy of information between successive frames in a video sequence.
This is typically done through some form of motion estimation and compensation. In this way a predictive model of the next raw input frame is inexpensively coded. This model frame is then directly subtracted from the target raw frame on the encoder side. The resultant difference image is referred to as the residue frame. This residue frame tends to be much less smooth than the original image frame. At the same time the residue frame tends to have less overall information (i.e. entropy) than the original image frame. The goal then is to compress the residue frame as efficiently possible such that when the decoder reconstructs the residues (or differences) and adds them back to the model frame, the resultant image will then approximate the original raw input in terms of perceptual quality.
Examples of such compression methods include the popular family of MPEG (MPEG-1, MPEG-2, and MPEG 4) and the H26x (H.261, H.263) standards. In fact the residue methods of these codecs are highly related to the JPEG and JPEG2000 still image compression methods with the additional caveat that specific modifications are made in order to make the compression of the highly variable residues more efficient.
As previously mentioned, a transform coder generally consists of two parts: (a) a decomposition of the multidimensional signal onto a specified set of basis functions and (b) the process of quantization followed by some manner of lossless encoding. We will now discuss the first of these two steps.
Decomposition and Basis Functions
FIG. 1A shows a continuous 1-D signal. FIG. 1B shows the same 1-D signal discretized at 60 sample points. Naively one could send the amplitudes at each of the 60 sample points. However, this would mean inefficiently compressing and transmitting 60 symbols, which may have a large dynamic range of possible values. One could quantize the symbols directly and then send the resultant data but this would have a very poor tradeoff in terms of quality vs. bit rate. In other words, as seen in FIG. 1C, we could send an imprecise representation of the amplitudes, which would require less bits but would result in poor reconstruction. One could also try to predict each successive value based on a localized prediction of its prior neighbors. If the function values do not vary much or follows a simple rule based on its predecessors then the differences between the real values and the predicted values (i.e. errors) can be represented more efficiently than the set of original values itself. Techniques based on such concepts include DPCM and predictive coding in general.
However, in the case of FIG. 1A there is in fact a much more optimal approach. This involves a decomposition or projection onto a set of specified basis functions. Here decomposition really means that the signal can be represented as a linear combination or weighted sum of a set of functions called basis functions. The multiplicative factors in the weighted sum are called the coefficients of the decomposition and represent the relative amount of projection onto a given basis function. The coefficients may have any value ranging from ∞ to +∞. The smaller the absolute value of a given coefficient the less important the corresponding basis function was to the overall decomposition or sum. Note in the case of continuous signals the decomposition is often an infinite sum, but for discrete signals with N sample points this sum can have at most N non-zero terms.
For a given signal not all possible sets of basis functions are equally good or efficient. Here efficiency is measured by the number of non-zero coefficients or, in more specific terms, by their inherent entropy. The more unique or varied the coefficients the more information or bits that must be transmitted. In many cases an efficient set of basis functions is chosen so as to satisfy certain properties such as periodicity or orthogonality; though this is not always necessary.
DCT Basis Functions
In FIG. 1A, the original function exhibits certain periodic properties. We would then like to choose a set of basis functions, which will result in a set of coefficients with much less than N non-zero values. A good choice here is the set of basis functions that define the discrete cosine transform, i.e. DCT. The functions themselves are a set of cosines with periods given as according to the generating equation in FIG. 1D where L denotes the period and the quantum number n distinguishes one member of the family of basis functions from another. The resultant decomposition onto the set of DCT basis functions shows that there are only three non-zero coefficients. In fact the three coefficients are respectively 100, 30, and 70 for the three basis functions displayed in FIG. 1E. In the simple example of FIG. 1B the entire discrete function over the entire 60 sample points can be exactly represented by only three values. This constitutes a significant reduction in the amount of information from the original 60 values. Moreover, provided the decoder knows to use the same set of cosine basis functions it can receive and decode the three symbols and then form the required summation thus perfectly reconstructing the function.
In the case of FIG. 1F, we have slightly altered the function depicted in FIG. 1B. Now the three previous coefficients of the decomposition cannot exactly represent the function by themselves. FIG. 1G, shows the resultant reconstruction error using only these three coefficients and the associated basis functions. To ensure a perfect reconstruction (i.e. no loss of quality) it is necessary to consider and thus transmit a larger number of coefficients. However, in this case the resultant error or inaccuracy is small, therefore it may be satisfactory to still send only the three non-zero coefficients of FIG. 1D depending on how much error can be tolerated for the given application paradigm. In other words depending on the application the additional coefficients needed to exactly reconstruct the signal in FIG. 1F may not be significant and thus it may be acceptable for the decoder to reconstruct the approximated signal shown in FIG. 1H using a small number of transmitted bits.
In FIG. 2A we face a more difficult challenge for the DCT basis functions. Now there is a very sharp transition or edge in the domain. FIG. 2B shows the discretized version of the continuous signal exhibited in FIG. 2A. FIG. 2C shows the reconstruction results based on maintaining a small number of coefficients. Now the residual error is very high. FIG. 2D depicts a case where many coefficients are used and the associated residual error is very small. It can be shown that the amount of error of the reconstruction is inversely proportional to the number of coefficients that are preserved and therefore must be transmitted to the decoder. Thus the DCT basis functions are not very efficient in this case. Note that the DCT is the primary transform of choice in the JPEG and MPEG 1-2-4 families of standards, though MPEG-4 allows for other transforms. In general the DCT does not perform well near sharp edges.
Other Families of Basis Functions
Fortunately more efficient decompositions for this case do exist. For FIGS. 2A-B, a better choice would consist of a differently family of basis functions known as the Haar functions (see FIG. 3). On the contrary the set of Haar basis functions would perform very poorly for the sinusoidal signal shown in FIG. 1A-B.
There are a multitude of transforms with associated basis functions used in image and video processing. These include but are not limited to the aforementioned DCT, the Haar, the discrete Fourier transform (DFT), the Karhunen-Loeve transform (KLT), the Lapped orthogonal transform (LOT), and the discrete wavelet transform (DWT). Each of these transforms has their advantages and disadvantages.
In general, especially in higher dimensional images (i.e. ≧2), it is intractable to adaptively determine an optimal basis set of functions for a given image. The work of Coifman et al. on adaptive wavelet packets [Coifman I] has demonstrated small nominal gains when applied to a wide range of image or video data. In order to be robust and at the same time efficient, it is in general better to use a set of basis functions with fundamental interpolatory properties. A good choice is often those sets of basis functions, which are generated via higher dimensional analogs of polynomial interpolators of relatively low order (i.e. linear, quadratic, cubic, etc). An examples of a basis function set construction based on this technique can be seen in the work of W. Sweldens [Sweldens I]. The construction of robust and efficient basis functions for transform coding naturally leads to a discussion of multi-scale transforms or multi-resolution analysis.
Multi-Scale Transforms
Basics
Examples of multi-scale transforms can be found almost everywhere in the field of image and video processing. There applications include spectral analysis, image denoising, feature extraction, and, of course, image / video compression. JPEG2000, the Laplacian pyramid of Burt & Adelson [Burt and Adelson I], traditional convolution wavelet sub-band decomposition, and the lifting implementation of [Sweldens I] are all examples of multi-scale transforms. Many variations of multi-scale transforms differ in regards to how the transform coefficients are quantized and then encoded. Such variations include SPIHT by Said and Pearlman [SPIHT I], EZW (see [Shapiro I]), trellis coding (see [Marcellin I]), etc.
All multi-scale transforms operate on one guiding principle. Namely, that the efficient representation of a given multi-dimensional signal is characterized by looking at the data via a decomposition across different scales. Here a scale refers to a characteristic length scale or frequency. Coarse scales refer to smooth broad transitions in a function. The very fine scales denote the often sharp, local fluctuations that occur at or near the fundamental pixel scale of the signal.
FIG. 4A illustrates an example of different scale information for a given 1-D signal. Note that the function is actually well characterized as a smoothly varying coarse scale function f1(x) (see FIG. 4B) plus one other function depicted in FIG. 4C, f2(x). The function f2(x) contains the majority of the fine scale information. Note that f2(x) tends to oscillate or change on a very short spatial scale; whereas f1(x) changes slowly on a much longer spatial scale. The communications analogy is that of a carrier signal (i.e. coarse scale modulating signal) and the associated transmission band (i.e. high frequency or fine scale signal). In fact by referring to FIGS. 4A-C one can see that the complete high frequency details are well characterized by f2(x) and the low frequency or average properties of the signal are exhibited by f1(x). In fact few signals are as cleanly characterized into specific scales as the function depicted in FIG. 4A.
In the following sections we will describe a mathematical operator known as a filter. Here the basic definition of a filter is a function of coefficients which when applied as a convolution operation to a signal will result in a series of multiplications and additions involving the values of the input signal and which will result in yet another signal. Usually the sum of the filter coefficients is either one when computing averages or zero when computing differences.
Construction of Coarser Scale Representations (1-D)
For an arbitrary multi-dimensional signal the construction of multiple scales is generally achieved through a successive application of localized averaging and sub-sampling. FIGS. 5A-D show this process for a more complicated 1-D signal. The original data itself in fact corresponds to the very finest scale herein labeled scale 1 as seen in FIG. 5A. Then an ‘averaging’ filter is applied across the domain and sub-sampled at a subset of the points. In FIG. 5B an averaging filter of (0.25, 0.5, 0.25) was first convolved (i.e. weighted average) across the original signal. But this produced a resultant signal that is still sampled at 20 points. Now we sub-sample the resultant function at every other point thus obtaining the signal in FIG. 5C with only 10 sample points. This is now the next coarser band or scale, i.e. scale 2. This process is often called an ‘update’.
The process of averaging and sub-sampling, or ‘updating’, can be performed again on the function in FIG. 5C using the same averaging filter and the same sub-sampling rule to obtain the next coarser band, scale 3, as depicted in FIG. 5D. In principle this procedure can be repeated until only one sample point is left thereby representing the coarsest scale and thus the overall average of the entire original signal shown in FIG. 5A. In practice, however, the number of distinct scales is chosen ahead of time by the multi-scale transform coder. The totality of the multiple scales can be viewed as a multi-resolution pyramid where each scale corresponds to one level of the pyramid.
Construction of Coarser Scale Representations (2-D)
FIGS. 6A-E show a similar process in 2-D. The original pixel data, or finest scale, is denoted in FIG. 6A. Here the averaging filter at each scale is depicted in FIG. 6B as well as an example sub-sampling rule. In this case the sub-sampling rule is referred to as a quincunx lattice in 2-D and once again preserves half the points at each step. FIGS. 6C-D show successive steps in building the multi-resolution pyramid for a square domain via application of the filter and sub-sampling logic depicted in FIG. 6B. At each step of the process the numbers at each pixel refer to the functional value of the pyramid at a given scale. Note that the scale depicted in FIG. 6D contains almost one quarter of the sample points in the original 2-D function shown in FIG. 6A because each application of the quincunx sub-sampling reduces the number of points by a factor of two. Another popular 2-D form of sub-sampling is the standard quarter sub-sampling displayed in FIG. 6E. In order to handle boundary effects for the convolution at the edge of the pictured rectangular domain, it is assumed that the data at each scale can be extended via a mirror symmetric extension appropriate to the dimensionality of the signal across the boundary in question. The motivation and the efficacy of this will be discussed in more detail in the background section entitled “Multi-scale transforms and image boundaries”.
Other Variations
The procedure can be generalized to much more sophisticated averaging filters. One such example is the 1-D averaging filter of the 9×7 Daubechies filter often used in JPEG2000 for still image compression. In this case the filter is applied as a separable convolution with one pass in the horizontal direction followed by another in the vertical direction. Note for each 1-D pass the sub-sampling rule is once again the selection of every other pixel in either a row (horizontal) or in a column (vertical). After both directional passes this reduces to the quarter sub-sampling denoted in FIG. 6E. Moreover, after the two 1-D passes (as shown in FIG. 6F) are completed, the effective averaging filter becomes that depicted in FIG. 6G with a very large support or domain. Note in FIG. 6G not all of the 81 coefficients are shown because the blank locations have amplitude values which are less than ˜10−4 and as such are insignificant for the purposes of the figure. Such a large filter can be particularly sensitive when dealing with very sharp edges or very spiky data such as that encountered during the residue transform coding of video codecs.
In general, the nature of the averaging filters as well as the sub-sampling logic used at each successively coarser scale can be freely chosen. However, in practice, they are selected in such a way that certain properties of the transform are obeyed (i.e. symmetry, perfect reconstruction in the limit of no quantization, compactness, etc.). Though this imposes a set of constraints (see [Daub I] and [Sweldens I]), for the purposes of this invention the nature of these constraints is unimportant. It is also possible to forego any averaging whatsoever, thereby reducing the multi-scale transform to a hierarchical sub-sampling tree such as in Binary Tree Predictive Coding (BTPC).
Prediction of the Next Finer Scale
The second critical element of a multi-scale transform is the concept of a ‘prediction’ filter. This filter usually exhibits some form of interpolatory properties in order to predict, to some level of accuracy, a finer scale from the parent scale just above. Consider FIG. 7A. The displayed function is identical to that depicted as the resultant scale 3 function in FIG. 6D. If for example a nearest neighbor filter as shown in FIG. 7B is convolved with the function in FIG. 7A then we have a characterization or prediction at exactly half of the next finer scale points, i.e. the points denoted by dashed circles in FIG. 7C. The half of the points determined in this fashion is called the ‘alternate’ or ‘child’ grid. The remaining half at this scale is called the ‘peer’ grid, i.e. the points denoted by solid circles in FIG. 7C. For the sake of completeness the set of all points in FIG. 7A at the initial coarser scale are termed the ‘parent’ grid.
If at the next finer scale peer grid we simply propagate the parent grid values directly down one scale then we have filled in an estimate for the entire function at the next finer scale. Taken as a whole, in this example, FIG. 7C shows final predicted result for this scale. The associated error with respect to the original scale 2 function is depicted in FIG. 6C is shown in FIG. 7D. In practice one can select from any number of prediction filters in order to estimate a finer scale from a coarser one.
If one were to continue the process based on the reconstructed result shown in FIG. 7C by applying the prediction filter displayed in FIG. 7E, the reconstructed result would be as shown in FIG. 7F. The associated error with respect to the original scale 1 function is depicted in FIG. 6A is shown in FIG. 7G.
Note in the above example the prediction of the alternate and a peer grid was done separately. Let us focus on the peer grid estimation. Instead of directly propagating down the scale 3 values to the scale 2 peer grid as in FIG. 7C, the peer grid prediction can be accomplished through a form of reverse averaging called ‘inverse updating’. In this case either the inverse update is a function of more than one scale 3 parent grid points or is also a function of the predicted child values estimated on the alternate grid, i.e. the squares in FIG. 7C. Because of this distinction the process of estimating the child grid is often termed ‘prediction’ and the process of estimating the peer grid is termed ‘inverse update’. In the same vein the original process of creating coarser scales via averaging is often called ‘update’.
Multi-Resolution Pyramids
Laplacian Pyramid
The above principles of coarser scale construction and finer scale prediction are useful in a variety of image and video processing applications other than compression, i.e. denoising, image enhancement, signal analysis, and pattern recognition. However, in the case of image or video compression the two principles are combined with quantization in terms of a forward and an inverse transform. For the sake of clarity and brevity, a discussion based on the Laplacian pyramid paradigm of Burt and Adelson [Burt and Adelson I] will now be presented. Other strategies, including the traditional wavelet sub-band filters based on either convolution [Daub I] or lifting implementations [Sweldens I], differ mostly in their use of matched transform pairs for the update and the predict functions. In fact the lifting formulation shows how any generalized wavelet filter bank can be reduced to a series of combinations of two (or more) update and predict functions in a multi-scale scheme.
Forward Transform
In the forward transform a pyramidal decomposition is constructed where each level of the pyramid corresponds to a successively smoother representation or coarser scale of the image (see FIG. 8 for a generalized 2-D depiction). The method itself involves the same logic of averaging plus sub-sampling already described as part of the update process. As previously mentioned the selection of an appropriate update filter can be widely varying. Usually certain desired properties in terms of support size, response to noise, the degree of smoothness, and the amenability to inversion all play a role in the selected form of the update filter. The resultant scale after one step of averaging and sub-sampling can be referred to as a ‘low-pass’ version of the image.
FIG. 8 depicts the averaging process repeated N−1 times, thus constructing a pyramid of N levels. The bottom level of the pyramid or finest scale (scale 1) is the original image (or residue in the case of video) data. The top level represents the coarsest scale. In FIG. 8, where the level-by-level sub-sampling is the quarter sub-sampled lattice as described in FIG. 6F, the top level will represent points which are the effective weighted average over an M×M domain of sample points. Note that at higher and higher scales the number of sample points is reduced as a result of the sub-sampling procedure. It should also be noted that in some applications there might be no averaging whatsoever. Then the process of constructing the forward transform pyramid is reduced to that of a hierarchical sub-sampling such as in Binary Tree Predictive Coding (BTPC).
Inverse Transform
The stage is now set for the inverse transform. For any codec employing a multi-scale transform, the decoder side must start from an initial set of transmitted data received from the encoder. In the multi-scale paradigm this is the coarsest scale of averages, i.e. scale M or the top level of the pyramid constructed upon completion of the forward transform. If there are a sufficient number of levels in the pyramid the top-level will generally contain a relatively small number of sample points.
If the encoder-decoder pair does not perform quantization (i.e. lossless compression) then an exact representation of the top-level averages must be sent. However, if quantization is present then the top-level averages will be transmitted with reduced precision and hence less bits. For the moment we will focus on the no quantization scenario.
The next step in the inverse transform involves the predict functionality described in the previous section entitled “Prediction of the next finest scale”. In this way an estimation of the next finer scale, scale M−1, in the pyramid is calculated. The difference between the actual values at scale M−1 and the estimated values obtained via application of a set of predict filters to the parent scale, scale M, is in fact the error residuals. In the case of lossless compression, the encoder must send the exact representation of the error differences to the decoder. Then the decoder, which had started with the same parent scale data as the encoder, and after applying the same prediction logic as the encoder, will add the received error corrections back onto the estimated surface for scale M−1. If there has been no quantization the resultant function will be the original scale M−1 function constructed on the way up in the forward transform.
Similar logic is then applied to the formation of the remaining lower or finer levels of the pyramid. The process ends once the corrections for the bottom-most level of the pyramid, i.e. the original pixel data, are received and then added back onto the final predicted surface. Note that as previously mentioned, in a generalized version the predict function may in fact be split up into a predict step involving the alternate or child grid and an inverse update step involving the peer grid.
Inverse Transform and Reconstruction in the Presence of Quantization
In the presence of quantization the process is slightly more complicated. Remember that if high compression ratios are desired then having to send the exact representation of the error differences at each level will be very costly in terms of bits. To avoid this it is necessary to quantize the data in such a way that reasonable quality is achieved on the decoder side. FIGS. 9A-9B depicts an example of quantization. In FIG. 9A an example of a set of quantization intervals and their representative values are depicted. In FIG. 9A, for all the error differences, E, if their value lies between −Q<E<+Q the quantized result will be zero. For all E such that +Q<E<+2Q the quantized result would be +3/2 Q and so on. The result of applying the quantization function described in FIG. 9A to a set of 2-D sampled input data (as seen in the top portion of FIG. 9b) where Q=5 is also displayed in FIG. 9B at the bottom of the page.
With quantization the decoder will now receive a quantized approximation of the top-level averages which we will denote as scale Q(M). The error residuals between the real scale M values and Q(M) are deemed acceptable by the encoder for a given bit rate limitation. Now the decoder applies the aforementioned prediction machinery based on using Q(M) as the parent scale. This results in an estimated surface for scale M−1 which we will denote as P(Q(M)). The difference between the original M−1 and P(Q(M)) must now be quantized and sent to the decoder. After receiving the appropriate quantized error data and adding back to the corresponding predicted surface the decoder obtains a approximation of scale M−1 which can now be called Q(M−1). This procedure is repeated multiple times until a quantized approximation of scale 1 is achieved. The resultant approximation of scale 1 is in fact an approximation of the original input data and is thus the data that the decoder will ultimately display or represent. If the encoder-decoder pair is efficient at the prescribed bit rate the resultant reconstruction will exhibit a tolerable amount of perceptual error.
Many of the differences present in modem multi-scale transforms involve different approaches to the problem of optimal quantization in order to obtain the best possible reconstruction for a given bit rate. In addition, many conventional sub-band encoders will also separate each level of the pyramid into multiple sub-bands through an application of low-pass (i.e. averaging) and high-pass (i.e. differencing or predict) filters. Then the corresponding inverse transform with quantization involves separate logic for the reconstruction of a given sub-band at each finer scale of the multi-resolution pyramid. However, the basic framework of the forward and inverse transform is much the same as described above.
Multi-Scale Transforms and Image Boundaries
Rectangular Domains
Inherently in all practical situations any multi-dimensional image will have finite extent or domain. In the 2-D case that is to say the image has a finite width and height and hence a finite area. In most applications this domain will be rectangular in nature. As seen in FIG. 10A the 2-D image only specifies values for the pixels located between (0, N) in the x-direction and (0, M) in the y-direction. As all multi-scale transforms involve the application of either update or predict filters during the forward and inverse transforms, the codec must be mindful of the image boundaries. In fact this is also the case even when the image is broken up into rectangular sub-domains or ‘blocks’, provided data lying across a block boundary is considered independent of the data inside the block.
FIG. 10B shows one of the problems inherent in applying a filter operation, i.e. convolution, of any form near a rectangular boundary. In this example the support of the filter is 5×5 pixels. As such, for pixels located on the border there will be corresponding positions in the filter (i.e. the ‘over-hang’) that have no source in the original image for the purposes of the multiplication and subsequent addition operations which are involved in the application of a filter to an image. In many applications involving image or video compression, the standard procedure is to extend or pad the domain at locations where the filter support lies outside the image domain. The padding is accomplished by either filling in zeros or by replacing with a low-pass version of the interior data. Generally, however, the reconstructed signal will often exhibit undesirable high frequency artifacts near the boundary and the transform will lose efficiency near the border.
Another method is to apply a mirror image reflection (or ‘symmetric’) boundary condition. The procedure is outlined in 1-D in FIG. 10C. When the ‘missing’ image data for filter locations lying outside the block are replaced in such a manner then the multi-scale transform is guaranteed to be precisely invertible [Sweldens I] and the efficiency of the transform is maintained. The outlined procedure can be extended to 2-D and higher provided the boundary is rectangular. Similarly one can also define other meaningful extensions such as periodic extensions.
Arbitrary Shaped Domains
The present invention relates to the efficient application of multi-scale transforms to arbitrary shaped domains of an N-dimensional image. The above procedure of padding or extension is suitable only for rectangular domains. For instance, the approach of using a 2-D symmetric extension is not feasible for arbitrary shapes as in such cases a true 2D symmetric extension cannot even be defined. In FIG. 11, an example of generalized non-rectangular domains in 2-D is shown. Such shaped domains are encountered whenever an image processor segments an image frame and in fact MPEG-4 supports arbitrarily shaped video object layers. In principle the entire domain of the signal itself may be arbitrarily shaped or on the other hand the signal domain may be partitioned into a collection of arbitrarily shaped regions.
The techniques suggested to code a signal on an arbitrary shaped domain, as suggested by the MPEG-4 standards committee, include: difference predictive coding (DPCM) of vertices on a polygonal mesh, shape-adaptive DCT (SADCT), and separable wavelet transform with either symmetric, periodic boundary conditions, zero padding or low-pass extrapolation. We will now describe each technique in detail.
Coding of Vertices of 2-D Polygonal Meshes
One scheme that has been proposed for coding functions on arbitrary shaped domains is coding for polygonal meshes (see [Berg I]). The domain is tessellated into a grid of regular polygons (for example triangles). The function is assumed to be well represented by its values at the polygonal vertices (termed nodes). These values are then differentially coded. Typically, the function values are linearly interpolated within the polygon. MPEG-4, for instance, supports coding of triangular 2-D meshes. The size of the polygons determines the accuracy of the coding. Large polygons produce few nodes and thus the coding is bit-efficient. The function is however very poorly approximated within large flat regions. If the polygons are small, the function is well approximated, but the large number of nodes results in very large bit costs for transmission.
Shape Adaptive Discrete Cosine Transform (SA-DCT)
Another way that has been proposed to code functions on arbitrary shaped domains is the so-called Shape Adaptive DCT (see [Sikora I]). The domain is partitioned into fixed size blocks. Some blocks will be in the interior and some blocks will contain boundaries. The interior blocks are coded using standard DCT techniques. For the blocks at the boundaries, first a 1-D DCT is applied to the rows. The rows are of differing lengths since each one can contain an arbitrary number of interior pixels. The transformed rows are then re-ordered from longest to shortest and then a 1-D DCT is applied to each of the columns. The partial matrix is then coded using standard entropy techniques.
The advantage of the standard DCT approach comes from recognition that the lowers frequencies of the transformed matrix carry the visually significant information and accuracy in the high frequency coefficients can be sacrificed with no significant effect, In the SA-DCT, The columns of the re-ordered matrix contain both low(for the longer vectors) and high(for the shorter ones) frequency information. Thus, the transformed matrix does not have clearly identified low frequency and high frequency components. This significantly impacts the performance of the SA-DCT. Even though it is an allowed mode within the MPEG-4 standard, to date no commercial implementation of MPEG-4 includes the SA-DCT.
Separable 1-D Wavelet Coding with Padding
Yet another technique that has been proposed for coding functions on arbitrary domains is padding for the discrete wavelet transform or DWT (see [Kaup I] and [Li I]). As in the previous discussion, the image is broken up into square blocks of some fixed size. The blocks that are in the interior are coded using standard methods. The blocks that contain boundaries are handled in the special way. Each row in the block is padded with values to make a row of fixed length, then standard DWT (or DCT) techniques are used to code the block. Upon the decoding, the extra pixels are simply discarded. There are several choices for padding the row: symmetric extension, periodic extension, padding with zeros, and low pass extrapolation.
In all cases, this technique suffers from several problems. Since all the points in the block are coded, for a jagged boundary this will result in a significantly higher number of pixels coded than there really are in the domain of interest. Thus, significantly impacting the efficiency of the coding. Furthermore, the padded function might or might not have the same properties as the original function, leading to a reconstruction that is actually quite poor for the function on the domain of interest.
Impact of Internal Boundaries or Features
Another vexing problem for multi-scale transforms relates to the presence of sharp internal features or transitions within the domain of interest. Implicit in all multi-scale transforms is the premise that smoother representations of the signal, i.e. coarser scales, are useful in the prediction of the finer scale details or features. This is in general not the case at a very sharp internal edge boundary or feature. FIG. 12A-D shows several examples of such features: a trough or valley, a sharp edge transition or ‘cliff’, a local maximum, and an irregular surface.
Unless the quantization interval is very small and hence expensive, the reconstructed surface will be very erroneous in the neighborhood of these kinds of features. Of course one could finely quantize the error differences and code enough data in order to better approximate the input signal, but if the image domain contains many such sharp internal features this could become very costly in terms of bits. Even if the averaging and prediction filters are made more sophisticated such sharp internal transitions will still remain troublesome and cause the codec to become inefficient. Interestingly enough it is often the preservation of existing sharp transitions or edges in natural images that most greatly impacts the perceptual quality of the reconstructed signal.
In some cases others such as W. Sweldens [Sweldens II] have considered formulations where the prediction filters are adaptively altered as the central point approaches a sharp edge transition at a given scale. FIG. 13 displays the basic concept behind this method in that the support and hence order of the prediction filter tends to shrink as the edge transition is approached. Here order refers to the degree of the polynomial predict filter where order one is linear, two is quadratic, three is cubic, and so on. Of course this technique is only applicable when an accurate and robust edge detection method is available. Moreover, in practice this technique achieves relatively small nominal gains.