As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
In addition, as processor architectures improve in terms of raw performance, other considerations, such as the communication costs of storing and retrieving data, become significant factors in overall performance. Data is typically organized within a memory address space that represents the addressable range of memory addresses that can be accessed by a processor. Both the instructions forming a computer program and the data operated upon by those instructions are often stored in a memory system and retrieved as necessary by a processor when executing the computer program. In order to balance cost, performance, and storage capacity, multi-level memory architectures have been developed.
Often, a computer relies on a relatively large, slow and inexpensive mass storage system such as a hard disk drive or other external storage device, an intermediate main memory that uses dynamic random access memory devices (DRAM's) or other volatile memory storage devices, and one or more high speed, limited capacity cache memories, or caches, implemented with static random access memory devices (SRAM's) or the like (e.g., L1, L2, L3, etc. caches). In some instances, instructions and data are stored in separate instruction and data cache memories to permit instructions and data to be accessed in parallel. One or more memory controllers are then used to swap the information from segments of memory addresses, often known as “cache lines”, between the various memory levels to attempt to maximize the frequency that requested memory addresses are stored in the fastest cache memory accessible by the microprocessor. Whenever a memory access request attempts to access a memory address that is not cached in a cache memory, a “cache miss” occurs. As a result of a cache miss, the cache line for a memory address typically must be retrieved from a relatively slow, lower level memory, often with a significant performance hit.
In some designs, prefetching is used to reduce the impact of cache misses, and may be used both with instructions and the data processed by instructions. From the standpoint of instructions, prefetching typically refers to transferring instructions into a processor, or into a cache or other memory storage element disposed within or accessible by the processor, prior to an attempt to issue the instruction by the instruction issue logic for the processor. Likewise, for non-instruction data, prefetching typically refers to transferring data into a processor, or into a cache or other memory storage element disposed within or accessible by the processor, prior to a request for the data being generated via execution of an instruction within the processor.
For program instructions, for example, instruction prefetching is often used to attempt to initiate the retrieval of instructions into an instruction cache or buffer before the instructions are needed by a processor. Branch prediction may also be used, for example, to predict whether certain conditional branches will be taken, with prefetching used to initiate the retrieval of the instructions that are expected to follow the conditional branches if those branches are or are not taken.
Prefetching may also be used for other types of data, and often takes advantage of the fact that data requests are often somewhat regular in nature for a particular set of program instructions. Stride-based prefetching, for example, takes advantage of the fact that data is often retrieved in a pattern having a relatively constant offset, referred to as a “stride”, between successive accesses. Stride-based prefetching typically determines a difference between the memory addresses of consecutive accesses, and then prefetches additional data located one or more multiples of that difference from the prior accesses.
Caching and prefetching often provide substantial performance gains in data-intensive algorithms that rely heavily on retrieved data. One area that is particularly data-intensive is rasterization, and in particular texture processing performed in a rasterization process utilized in various image processing applications. Rasterization is a process in 3D graphics where three dimensional geometry that has been projected onto a screen is “filled in” with pixels of the appropriate color and intensity. A texture mapping algorithm is typically incorporated into a rasterization process to paint a texture onto geometric objects placed into a scene.
In order to paint a texture onto a placed object in a scene, the pixels in each primitive making up the object are typically transformed from 3D scene or world coordinates (e.g., x, y and z) to 2D coordinates relative to a procedural or bitmapped texture (e.g., u and v). The fundamental elements in a texture are referred to as texels (or texture pixels), and being the fundamental element of a texture, each texel is associated with a single color. Due to differences in orientation and distance of the surfaces of placed geometric primitives relative to the viewer, a pixel in an image buffer will rarely correspond to a single texel in a texture. As a result, texture filtering is typically performed to determine a color to be assigned to a pixel based upon the colors of multiple texels in proximity to the texture mapped position of the pixel.
A number of texture filtering algorithms may be used to determine a color for a pixel, including simple interpolation, bilinear filtering, trilinear filtering, and anisotropic filtering, among others. With many texture filtering algorithms, weights are calculated for a number of adjacent texels to a pixel, the weights are used to scale the colors of the adjacent texels, and a color for the pixel is assigned by summing the scaled colors of the adjacent texels. The color is then either stored at the pixel location in a frame buffer, or used to update a color that is already stored at the pixel location.
Bilinear filtering, for example, uses the coordinates of a texture sample to perform a weighted average of four adjacent pixels, weighted according to how close the sample coordinates are to the center of the pixel. Bilinear filtering often can reduce the blockiness of closer details, but often does little to reduce the noise that is often found in distant details.
Trilinear filtering involves using MIP mapping, which uses a set of prefiltered texture images that are scaled to successively lower resolutions. The algorithm uses texture samples from the high resolution textures for portions of the geometry near to the camera, and low resolution textures for the portions distant to the camera. MIP mapping often reduces nearby pixelation and distant noise; however, detail in the distance is often lost and needlessly blurred. The blurriness is due to the texture samples being taken from a MIP level of the texture that has been pre-scaled to a low resolution in both the x and y dimensions uniformly, such that resolution is lost in the direction perpendicular to the direction that the texture is most compressed.
Anisotropic filtering involves taking multiple samples along a “line of anisotropy” which runs in the direction that the texture is most compressed. Each of these samples may be bilinear or trilinear filtered, and the results are then averaged together. This algorithm allows the compression to occur in only one direction. By doing so, less blurring often occurs in more distant features.
However, it has been found that the performance of anisotropic filtering is greatly dependent upon the number of samples taken along the line of anisotropy. Larger numbers of samples greatly improve image quality, but also greatly increase the processing overhead of the algorithm. In addition, high quality anisotropic filtering often introduces substantial memory bandwidth limitations due to the need to retrieve the texture data necessary to perform the required calculations.
For example, assuming textures are stored in memory uncompressed and each pixel uses 16 bytes of RGB color and alpha channel information, at a setting of 16 texel samples per line of anisotropy, the memory traffic for each pixel of a rasterized polygon using anisotropic filtering would involve loading 256 bytes. Assuming, for example, a resolution of 1024×768 pixels and animation running at 30 frames per second, the bandwidth needed for a full screen of anisotropic filtered textures would be approximately 5.7 Gigabytes per second.
Given the high bandwidth required, it is crucial for performance reasons to ensure that as much as possible of the required texture data is cached in a processor, as otherwise the time required to retrieve the texture data into the cache would introduce substantial delays in an anisotropic filtering algorithm. However, particularly with larger textures and/or with long lines of anisotropy, even when texture compression is used, the texture data will often span numerous cache lines, so the required texture data is often not cached when it is needed by an anisotropic filtering algorithm, leading to unacceptable performance degradation.
Therefore, a need exists in the art for a manner of improving the performance of anisotropic filtering algorithms, particularly from the standpoint of minimizing the performance penalties associated with having to retrieve texture data in association with such algorithms.