In 3D computer graphics, surface detail on objects is commonly added through the use of image-based textures, as first introduced in 1975 by Ed Catmull (“Computer Display of Curved Surfaces”, Proc. IEEE Comp. Graphics, Pattern Recognition and Data Structures. May 1975). For example, an image of a sheet of wood may be applied to a set of polygons representing a 3D model of a chair to give the 3D rendering of that object the appearance that it is made of timber.
In a complex 3D scene, many such ‘textures’ may be required which can cause two related problems. The first is simply the cost of storing these textures in memory. Consumer 3D systems, in particular, only have a relatively small amount of memory available for the storage of textures and this can rapidly become filled. This is especially aggravated by the use of so called ‘true colour’ textures which typically have 32 bits per texture pixel—eight bits for each of the Red, Green, Blue, and Alpha (translucency) components.
The second, and often more critical problem, is that of bandwidth. During the rendering of the 3D scene, a considerable amount of texture data must be accessed. In a real-time system, this can soon become a significant performance bottleneck.
One approach to alleviate both of these problems is to use a form of image compression. Numerous such systems, for example the JPEG standard, are in use in image processing and transmission, but very few are suitable for real-time 3D computer graphics. The main problem with the majority of these compression schemes is that they do not permit direct ‘random access’ of pixel data within the compressed format, and fast arbitrary access is a requirement of the texturing process. This random access is often not possible because the per-pixel storage rate varies throughout the image, and so the schemes that have been proposed and/or implemented to compress texture data are usually restricted to fixed rate encoding.
Another requirement of any system that compresses texture is that the decompression process must be fast and inexpensive to implement in hardware. This usually eliminates ‘transform coding’ systems such as the discrete cosine transform, DCT, used in JPEG.
The most frequently used system is one based on colour palettes, which was originally used as a method of reducing the memory and bandwidth costs of video framebuffers. (see: “A Random-Access Video Frame Buffer”, Kajiya, Sutherland, Cheadle, Proceedings IEEE Comp. Graphics, Pattern Recognition, & Data Structures. May 1975). In such schemes, each pixel is represented by a small number of bits, typically 4 or 8, which stores an index into a table of colours, or palette, with 16 or 256 entries respectively. Numerous methods for reducing an original “true colour” image to this palettised format exist, with perhaps the best known being P. Heckbert's “Color Image Quantisation for Frame Buffer Display” (Computer Graphics Vol. 16, No. 3, July 1982, 297–307).
There are several drawbacks with the palette system. The first is that there is a level of indirection involved in decoding each texture pixel, also called a texel. A texel's index must first be read from memory and then the corresponding entry in the colour table must be accessed. Bringing the colour table ‘on chip’ or using dedicated RAM can ‘hide’ the time delay of the double read, but this too incurs additional penalties other than just the gate cost. For example, each time a new texture is accessed, the dedicated palette table must be reloaded. Alternatively a global palette could be used for all textures, but this would severely reduce the quality of the compressed images. Finally, the storage and bandwidth savings are not outstanding for 8 bits per pixel (or bpp) textures, and the quality of 4 bpp is generally poor.
Further cost complications arise due to texture filtering. Usually when a texture is applied to a surface, a weighted filter is applied to several sampled texels to avoid the texture appearing “blocky”. Bilinear filtering is one commonly used method which also forms the basis for some more sophisticated techniques. For example, FIG. 1 illustrates a triangular surface ‘1’ to which a texture ‘2’ is being applied using bilinear filtering. The location of each screen pixel covered by the triangle, such as ‘3’, is mapped back into the coordinate system of the texture, and the four texels surrounding this position are identified, ‘4’. A weighted average of these four texels are then used to compute the applied texture value.
In a real-time system it is preferable for the texel fetching system to be able to supply all the texels of this 2×2 block in parallel in order to allow the texturing process to operate at maximum speed. In a texturing system incorporating colour palette textures, these steps would involve fetching the indices for all four texels and then finding each texels' corresponding colour in the table. It can be appreciated that, unless it is multi-ported, the palette/colour table RAM could become a bottleneck. This represents an additional cost. (The problem of addressing multiple indices in parallel is identical to that of using non-compressed textures and so has been ‘solved’ in various ways in the art and need not be considered here).
Palettised textures are a form of vector quantisation compression, or VQ, and more complex forms have been used in 3D Computer Graphics. Beers et al (“Rendering from Compressed Textures”, Computer Graphics, Proc. SIGGRAPH. August 1996, pp 373–378) simulated a hardware renderer that used VQ textures. This offered storage costs from around 2 bpp down to the equivalent of ½ bpp by replacing each 2×2 or 4×4 block of pixels with a single index into a large look-up table. A simpler VQ system, offering 2, 1, and ½ bpp compressed texture data rates, was implemented in hardware in the SEGA DREAMCAST™ games console. Interestingly, the two higher compression ratios in this system were achieved by combining two levels of vector quantisation.
Although these forms of VQ offer high levels of compression at reasonable quality, they still suffer from needing two memory accesses. Furthermore, the size of the look-up table is much greater than that of the palettised textures and so any internal storage or caching of the look-up table is more expensive. The filtering costs also become greater due to wider data structures.
There are also a number of alternative compression methods based on Block Truncation Coding, or BTC, as presented by Delp and Mitchell (“Image Compression Using Block Truncation Coding”, IEEE Trans. Commun. Vol. COM-27, September 1979). In BTC, a monochrome image is subdivided into non-overlapping rectangular blocks, say, 4×4 pixels in size, and each block is then processed independently. Two representative values, say of 8 bits each, are chosen per block and each pixel within the block is quantised to either of these two values. The storage cost for each block in the example is therefore 16 * 1 bit plus 16 bits for the two representatives, thus giving an overall rate of 2 bpp. Because the blocks are independent, this simplifies the compression and decompression algorithms, however this could potentially lead to artefacts across block boundaries.
Cambell et al, (“Two bit/pixel full color encoding”. SIGGRAPH '86 Conference Proceedings, Computer Graphics, Vol. 20, No. 4, August 1986, pages 215–223) introduced Color Cell Compression, CCC, which extended BTC to encode colour images at 2 bpp. Unfortunately this required an external palette and the example images also show some evidence of colour banding. Despite these shortcomings, Knittel et al (“Hardware for Superior Texture Performance”, Proceedings of the 10th Eurographics Workshop on Graphics Hardware, 33–39, 1995) suggested using these image compression schemes in a texturing system.
In U.S. Pat. No. 5,956,431, Iourcha et al also adapt the BTC method to encode colour. In Iourcha's system, often referred to as S3TC or DXTC, each block stores two representative colours, typically at 16 bits each. Each pixel in the block is encoded typically using two bits and so can refer to four different values. Two of these values reference the two representative colours while the other two reference two colours that are derived directly from the two representatives. Usually the two derived colours are linear blends of the main representative colours, although sometimes one of the other values is chosen to indicate a fully transparent pixel. As with BTC, each block is completely independent of every other block.
The quality of the S3TC system is generally higher than that given by CCC and avoids the need for a colour palette, but these advantages are achieved at the price of approximately doubling the storage costs to 4 bpp. Furthermore, because this compression method is limited to only four colours per block, certain textures have been known to display banding. Also, as with BTC, there may be some artefacts at block boundaries.
If we consider bilinearly filtering a texture compressed with the S3TC system, we see that although in many cases the 2×2 set of texels required for the weighted filter could be fetched from a single 4×4 pixel block, there exist situations where more blocks are needed. The worst case situation, as shown in FIG. 2, arises when each pixel of the 2×2 set, ‘10’, belongs to a different 4×4 block, ‘11’. A real-time system that could texture one bilinearly filtered screen pixel per clock would therefore have to be able to access and decode four blocks in parallel.
Another scheme that mixes a block-based system with a palette-like approach has been presented by Ivanov & Kuzmin. (“Color Distribution—A New Approach to Texture Compression”, Eurographics 2000). Here each block stores at least one base colour but a local palette is implied by allowing access to a certain set of the neighbouring blocks' base colours. In an example system, the local palette for a particular block may have access to the base colours from an additional 3 neighbouring blocks—for example the choice might be to use the base colour from the block to right, the one below, and the one to the ‘below and right’. Each texel in a block would thus be represented by a two bit index accessing one of the four available base colour choices. This system would need a cache of base colours in order for it to be efficient since it would still be expensive to repeatedly access the neighbouring colours. Note also that the worst-case situation for bilinear filtering with this scheme can involve access to any of nine different blocks—this is shown in FIG. 3, where to decode pixel 20, access to one of the base colours of blocks 21, 22, 23, and 24 is required while for pixel 25, access to one of 24, 26, 27, and 28 is needed.
Another texture compression scheme, called FXT1, was published in 1999 by 3 dfx Interactive, Inc. This used 8×4 blocks, each of which could be compressed in four different ways. One such block mode was similar to the S3TC (CC_MIXED), while another, (CC_CHROMA) stored a local 4 colour palette which could be indexed directly.
Although not directly applicable to texture compression, multi-resolution image analysis and wavelet techniques (e.g. as described in “Wavelets for Computer Graphics. Theory and Applications”, Stollnitz, DeRose, & Salesin. ISBN 1-55860-375-1) has been applied to image compression with some success. This technique makes use of the fact that a low-resolution version of an image, which is subsequently scaled up, is frequently a good approximation of the original image.
FIG. 4a illustrates this process with a grey-scale image (due to the restrictions of monochromatic printing). The source image, ‘40’, is filtered down by a factor of four in both the x and y dimensions to produce a low-resolution version, ‘41’, that has 16 times less data. In this example a linear wavelet has been applied twice in both the x and y directions and the difference signals discarded. This has then been bilinearly scaled up to produce the low frequency image, ‘42’. The difference between this and the original signal is shown (amplified for illustrative purposes) in ‘43’.
The difference signal needed to reconstruct the original image from the scaled version often requires very few bits per pixel. In fact frequently a lot of the data can be thrown away to give a lossy compression system. This works well for natural images, but graphics, e.g., line drawings or text, often has a much greater amount of information in the delta signal and so the technique may not produce a good compression rate for this class of images. (This is analogous to the ringing artefacts frequently seen around text in a JPEG-compressed diagram).
The present invention aims to provide better quality compression per bit of compressed data than that of S3TC. It uses a fixed rate of encoding or compression with a reasonably simple decompression algorithm. Unlike the CCC or VQ systems, the present invention does not require a secondary data structure such as a colour look-up table.
Drawing from work in the related image processing fields of wavelets, it has previously been noted that a down filtered and subsequently up-scaled image can be a reasonable approximation of the original. Just such a signal will be referred to as a low frequency signal. The inventor has appreciated that such signals can be efficiently constructed by sharing data associated with neighbouring groups or blocks of pixels.
The difference between the low frequency signal and the original image can be computed and will be referred to as the difference data or delta signal. Furthermore, the inventor has appreciated that for a great many textures, the delta signal, i.e. the difference between the low frequency signal and the original image, is locally relatively monochromatic. For example the pixels in one local region of the delta image might be predominantly blue-yellow (i.e. complementary colours).
By using the delta signal and the low frequency signal, two new low frequency signals, A and B, may be constructed so that each pixel of the original image can be closely approximated by a per pixel linear blend of the A and B signals. Returning to FIG. 4a, image ‘50’ represents the image produced by the low frequency A signal, which in this example is an approximate ‘lower’ bound on the image, while ‘51’ represents the image produced by the low frequency B signal and gives an approximate ‘upper’ bound. Note that in this example, both the A and B signals contain at least 16 times less information than the original image. Using A and B two sets of reduced size data, A′ and B′ respectively, may be generated and it is the sets of reduced size data A′ and B′ that are stored as the compressed data.
To perform the linear blend, a modulation signal is also required. This is illustrated by image 52 in FIG. 4b. Each pixel, (y,x, in this image represents a fractional value, 0≦αy,x≦1, and is chosen so that
(1−αy,x).Ay,x+αy,x.By,x approximately equals the corresponding pixel in the original image.
Decompression of the texture proceeds by identifying the pixel to be decompressed and obtaining its modulation value, ‘53’. The data required to generate the low frequency A and B signals at that pixel are also obtained and used to produce the corresponding pixels in A, ‘54’, and B, ‘55’. These are blended, ‘56’, to produce the decompressed pixel, ‘57’ Applying this process to all the pixels decompresses the entire image, ‘58’.
Note that the A and B signals, prior to up scaling, can be at relatively low resolution. For example, resolutions of four to eight times lower than the original texture will usually produce good quality images. Furthermore, very little per-pixel modulation data, typically one or two bits per pixel, is required. This is partly due to the fact that the human eye tends to mask noise in regions with large changes in luminance (according to Delp and Mitchell).
The present invention maintains the advantages of simple addressing that come from BTC and its variants, yet avoids many of the discontinuity problems that frequently occur at the block boundaries in that scheme. The invention is also optimised to decompress, in parallel, the four pixels required for a bilinear filtering operation.
The A & B signals can be produced by numerous functions, however tensor product surfaces, such as bilinear or bicubic functions, are suggested as a means of up scaling. This choice represents a trade-off of quality versus implementation cost.
The modulation data will typically be stored in an array of ‘storage’ blocks, with each ‘storage’ block containing information to produce N×M modulation values, where N and M typically correspond to the scaling factors used for the low frequency signals. The low resolution information needed to reconstruct the A & B signals can either be stored in a separate array, or alternatively one A&B low resolution pair could be kept inside each of the ‘storage’ blocks. The former system is advantageous if the up-scaling function is one such as bicubic which requires numerous data control points to perform the interpolation. The latter method is useful for simpler up-scaling functions such as bilinear which typically require few data control points.
If a system uses the former ‘separate data storage’ system and also maintains a cache of recently fetched modulation and A&B values, it is beneficial for bandwidth reasons to arrange the behaviour of up-scaling function so that the A & B values needed to decompress a particular section of texels is ‘out of phase’ with the required block modulation data. That is, as a run of texels is accessed, the cache will generally be alternatively fetching modulation data and then A&B values as this should even out the requests to external memory. This is illustrated in FIG. 4c. In this example bilinear upscaling of the A & B data has been used. It shows a set of texels, ‘70’, belonging to the image, of which texel ‘71’ is to be decompressed. The ‘storage’ block containing the modulation value associated with texel ‘71’ contains the modulation values for 4×4 adjacent texels, ‘72’. Because bilinear has been used, (up to) four pairs of A & B representative values, ‘73a’, ‘73b’, ‘73c’, and ‘73d ’ are needed to generate the low frequency A & B signals for texel ‘71’. The adjacent texel, ‘74’, uses the same set of modulation values, 72, but needs a slightly different set of A&B values, i.e. ‘73c’, ‘73d’, ‘73e’, and ‘73f’. It is assumed that previous values used for texel ‘71’ will still be residing in a cache. It should be appreciated that if the representative values for A & B were instead aligned on the corners of the ‘72’, that although the average rate of data access would be the same, the peak data rate would have to be higher as both new modulation and A&B values would be fetched simultaneously. A second reason for preferring the out-of-phase arrangement is that it is natural to align blocks of modulation information with integer multiples of the block sizes. If the A&B values are also so aligned, then there is unevenness to the spread of representatives for the edges of the texture in that there is a preference for left and top edges.
The invention could also be extended to use a greater number of signals. For example, a third channel, C, could be introduced along with an additional per pixel modulation value. Barycentric coordinates could then be used to blend the signals.
The invention allows expansion of four neighbouring elements in parallel by for example, bilinear expansion or using tensor product surfaces. It recursively identifies quadrants and uses simple 90 degree rotations to map each of the four quadrant cases down to a single case. The calculations may be carried out at low precesion and expanded up to full precision.
The invention provides a means of encoding n-dimensional source data arranged in an m-dimensional array. The apparatus requires a means of generating P data signals, R, where P is two or greater, and where each of the signals also consists of n-dimensional data arranged in an m-dimensional array. Each of the R signals have fewer representative data elements than the original source data. Also provided is a means of obtaining or generating modulation information, M. The modulation data is also arranged in an m-dimensional array with the same number of elements as the original source data. Each datum in M consists of (P−1) scalar values. A means of storing the M and R data is provided. A means of interpolating each of the P individual R data signals to a one to one correspondence with that of the original source data and a means of applying M to the interpolated K data is provided. For each datum of the original source image, the corresponding M datum, consisting of (P−1) scalar values is used to interpolate via barycentric co-ordinates the corresponding P data items of the interpolated R signals. A means is also provided for optimising the stored R and M data such that, for the application of the M data to the interpolated R data, the result is near to the corresponding original datum.
The invention in its various aspects is defined in the independent claims below, to which reference should now be made. Advantageous features are set forth in the appendant claims.
Further adaptations and applications of embodiments of the invention will be described later.