The present invention relates to methods and apparatus for controlling a cache memory and, more particularly, to methods and apparatus for pre-fetching data into the cache memory.
In recent years, there has been an insatiable desire for faster computer processing data throughputs because cutting-edge computer applications are becoming more and more complex, and are placing ever increasing demands on microprocessing systems. Graphics applications are among those that place the highest demands on a microprocessing system because they require such vast numbers of data accesses, data computations, and data manipulations in relatively short periods of time to achieve desirable visual results. Conventional microprocessors have very rapid cycle times (i.e., the unit of time in which a microprocessor is capable of manipulating data), such as two nanoseconds, although the time required to access data stored in main memory may be considerably higher than the cycle time of the microprocessor. For example, the access time required to obtain a byte of data from a main memory implemented using dynamic random access memory (DRAM) technology is on the order of about 60 nanoseconds.
In order to ameliorate the bottleneck imposed by the relatively long access time to DRAM memory, those skilled in the art have utilized cache memories. Cache memory is significantly faster than DRAM memory, and augments the function of data storage provided by the main memory. For example, an L2 cache memory may be coupled externally to the microprocessor or an L1 cache memory may be coupled internally with the microprocessor, which memories are significantly faster than a main memory implemented utilizing DRAM technology. An L2 cache memory may be implemented utilizing, for example, static random access memory (SRAM) technology, which is approximately two to three times faster than DRAM technology. An L1 cache memory is even faster than an L2 cache memory.
Due to the relatively high cost of cache memories, they are typically much smaller than main memory. Consequently, conventional algorithms have been employed to determine what data should be stored in the cache memory. These conventional algorithms may be based on, for example, the theoretical concept of xe2x80x9clocality of reference,xe2x80x9d which takes advantage of the fact that relatively small portions of a very large executable program are used at any particular point in time. Thus, in accordance with the concept of locality of reference, only the small portions of the overall executable program are stored in cache memory at any particular point in time.
The particularities of the known algorithms for taking advantage of locality of reference, or any other concept, for controlling the storage of data in a cache memory are too numerous to present in this description. Suffice it to say, however, that any given algorithm may not be suitable in all applications as the data processing goals of various applications may differ significantly.
In graphics applications, for example, strong data storage localities may exist in main memory, which localities are a function of the coordinates of pixel data within main memory. More particularly, the main memory utilized in a graphics application may be conceptually organized into a two-dimensional plane, where the information for each pixel in an image is stored at a particular X, Y coordinate. The main memory utilized in a graphics application may be further organized into a plurality of rectangular blocks, where each block contains the information for a plurality of pixels. The addressing of the pixels within a particular block may be highly linear, i.e., the access to the pixel information may be highly sequential within a given rectangular block.
Thus, conventional algorithms for controlling a cache memory in a graphics application typically dictate that a given cache line stores the pixel information for an entire rectangular block. When a cache hit occurs (i.e., when a memory access request for a particular byte of data can be satisfied by accessing the cache memory), the high linearity of the pixel information within the given rectangular block is likely to result in a cache hit for the next data access request. On the other hand, when a cache miss occurs (i.e., the memory access request for a particular byte of data cannot be satisfied by accessing the cache memory), all of the pixel information within the rectangular block in which the desired byte of data is located is read from main memory and stored in the cache memory. Unfortunately, these conventional algorithms do not adequately exploit the characteristics associated with the linearity and strong localities of memories used in graphics applications. Indeed, such algorithms result in an undesirable number of cache misses and, therefore, result in less efficient memory control, lower processing throughput, and inferior graphics imaging.
Accordingly, there are needs in the art for new methods and apparatus for controlling an overall memory, including a cache memory, that exploit the linearity and strong localities exhibited in memories used in graphics applications in order to improve memory efficiency, increase processing throughput, and improve the quality of graphics images.
In accordance with at least one aspect of the present invention, an apparatus for controlling a memory includes: a main memory partitioned into a plurality of two-dimensional blocks, each block including a plurality of data units containing one or more bytes of data, and each given block being partitioned into a plurality of zones, each zone containing one or more of the data units and at least some of the zones being associated with a respective block adjacent to the given block; an address decoder operable to receive addresses to the bytes of data in the main memory, and to decode the addresses to determine in which zones the respective data bytes are located; and a pre-fetch address generator operable to generate a request to pre-fetch the blocks that are associated with respective zones into a cache memory when the address decoder receives addresses to bytes of data located in the respective zones.
By way of example, each of the blocks may include at least one of: a left boundary common with a left adjacent block, a right boundary common with a right adjacent block, an upper boundary common with an upper adjacent block, and a lower boundary common with a lower adjacent block. Further, each of the blocks may include at least one of: a left zone associated with the left adjacent block and including at least one data unit disposed along the left boundary, a right zone associated with the right adjacent block and including at least one data unit disposed along the right boundary, an upper zone associated with the upper adjacent block and including at least one data unit disposed along the upper boundary, and a lower zone associated with the lower adjacent block and including at least one data unit disposed along the lower boundary.
Utilizing this partitioning: the left zone may be defined at least in part by a left dividing line extending from the upper boundary to the lower boundary, the left zone including at least some of the data units located between the left boundary and the left dividing line; the right zone may be defined at least in part by a right dividing line extending from the upper boundary to the lower boundary, the right zone including at least some of the data units located between the right boundary and the right dividing line; the upper zone may be defined at least in part by an upper dividing line extending from the left boundary to the right boundary, the upper zone including at least some of the data units located between the upper boundary and the upper dividing line; and the lower zone may be defined at least in part by a lower dividing line extending from the left boundary to the right boundary, the lower zone including at least some of the data units located between the lower boundary and the lower dividing line.
Preferably, the left, right, upper, and lower dividing lines correspond to certain bits of the addresses, and the address decoder is operable to decode the addresses to determine in which zones the respective data bytes are located based on the certain bits.
By way of example, each block may include a central zone containing the one or more data units located between the left, right, upper, and lower dividing lines; and the central zone is preferably not associated with any adjacent block. Preferably, each block further includes at least one of: an upper left zone containing the data units located between the left boundary, the upper boundary, the upper dividing line, and the left dividing line (and the upper left zone is not associated with any adjacent block); a lower left zone containing the data units located between the left boundary, the lower boundary, the lower dividing line, and the left dividing line (and the lower left zone is not associated with any adjacent block); an upper right zone containing the data units located between the right boundary, the upper boundary, the upper dividing line, and the right dividing line (and the upper right zone is not associated with any adjacent block); and a lower right zone containing the data units located between the right boundary, the lower boundary, the lower dividing line, and the right dividing line (and the lower right zone is not associated with any adjacent block).
In accordance with at least one further aspect of the present invention, a method includes: partitioning a main memory into a plurality of two-dimensional blocks, and partitioning each block into a plurality of data units containing one or more bytes of data; partitioning each block of the main memory into a plurality of zones, each zone containing one or more of the data units; associating at least some of the respective zones of each block with others of the blocks; and pre-fetching at least a portion of one of the other blocks into a cache memory when any one of the data units within any of the associated zones of that block is addressed.
Preferably, each of the at least some zones of each block is associated with one adjacent block. Further, at least one zone of each block is preferably disposed along a boundary between the given block and an associated, adjacent block.
Other aspects, features, advantages, etc. will become apparent to one skilled in the art when the description of the invention herein is taken in conjunction with the accompanying drawings.