The present invention relates in general to graphics systems, and in particular to methods and apparatus for prefetching cache lines in a graphics system.
The sophistication of the market for computer and video graphics and games has exploded over the last few years. The time when simple games such as xe2x80x9cPongxe2x80x9d was a marketable product is far in the past. Today""s garners and computer users expect realistic three dimensional (3-D) images, whether the images are of a football game, race track, or new home""s interior. Accordingly, this appetite has focused designers"" efforts to improving the quality of the images produced by graphics systems in computers and video games.
Increasing the realism of video requires a higher screen resolution as well as displaying items as 3-D contoured objects, rather than simple two dimensional (2-D) pictures. These 3-D objects can be separated into 3-D shapes covered by a 2-D or 3-D texture.
A monitor""s maximum resolution is set by the number of pixels on its screen. In color monitors, each pixel is made up of a red, green and blue xe2x80x9cdotxe2x80x9d in close proximity to one another. By varying the intensity of the xe2x80x9cdotsxe2x80x9d, the color and brightness of the pixel can be changed. The more pixels on a screen, the more realistic an image will appear. For example, if a typical tire on a race car is represented on the screen by one pixel, that pixel will be black. A single black spec on a screen would not make for a very impressive tire. But if the tire is represented by many pixels, then details such as shape, hub caps, lug nuts can be seen, and the image is more convincing. To add more realism, a texture, for example tire tread, can be added. Where the rubber meets the road, an asphalt texture may be used.
These textures are stored in memory, and are retrieved as required by the graphics system. They may be two dimensional or three dimensional. Two dimensional textures are two dimensional images, and the dimensional coordinates are typically labeled either s and t, or u and v. In systems using a conventional bilinear filter, four pieces of texture information, referred to as texels, are used to determine the texel value, which is the texture information for one pixel. 16 bits is a common size for each texel. Alternately, texels may be 4, 8, 32, or any other integral number of bits in size. Three dimensional textures are sets of two dimensional textures, and the coordinates are usually labeled s, t, and r. Trilinear filtering is common in systems supporting three dimensional textures, and uses 8 texels to determine the texture information for one pixel.
But this means that a huge amount of information is needed to supply the textures for a video image. For example, a conventional monitor screen having a of 1280 1024 pixel resolution with a 75 Hz refresh rate requires about 100M pixels per second. Since four 16 bit texels are used for each pixel, such a system operates at 6,400M bits per second, or 800M bytes per second.
This texel information is stored in memory for fast access by the graphics controller. Preferably it would all be stored in memory on the same chip as the other graphics system elements, using fast circuitry, such as static random access memory (SRAM). But SRAMs are large, and have high operating currents, so the die area and power costs are prohibitive.
A conventional solution to the problem of making a fast but cost effective memory is to use an architecture type known as a memory hierarchy. The concept behind memory hierarchy is to use a smaller amount of SRAM, preferably on-chip, and have a larger memory off-chip using less expensive circuitry, such as dynamic random access memory (DRAM). This way, some data needed quickly by the graphics controller is readily available in the on-chip fast SRAM, while the bulk of the data waits in the DRAM. If the controller needs data that is not available in the SRAM, it can pull the data from the DRAM and overwrite existing data in the SRAM. In this system, the SRAM is known as the cache, and the DRAM is the main memory. Memory hierarchy systems using cache may be used for storing texels in graphics systems.
FIG. 1 is a block diagram illustrating one such conventional system. Central processing unit (CPU) 100 can access data directly from cache memory 110. If the required data is not present, a copy is moved from the main memory 120, to the cache memory 110. Extra capacity and storage when the system is powered down is provided by an input output device such as a disk 130. Each element in the memory hierarchy from left to right has a slower access time, but has a lower per bit storage cost. In this way a system may be optimized for both access time and cost.
The CPU 100 uses the data in the cache memory 110 by making requests for data to cache 110 and reading data from the same. If the CPU 100 requests data not present in cache 110, a cache miss is said to have occurred. In this case, the cache will retrieve data from the main memory 120, store it, and provide it to the CPU 100. Similarly, if the main memory 120 does not contain the required data, the main memory 120 will retrieve data from the disk 130. If CPU 100 requests data which is present in cache 110, a cache hit is said to have occurred, and the data does not need to be retrieved from the main memory 120.
Data may be found in the main memory and stored in cache according to its frame address. A frame address may be divided into three portions, the tag, index, and offset. Generally, the tag is the higher order bits of the frame address, the offset is the lower, and the index is between them. The index determines the location of a data block in cache; the location is referred to as a cache line. The offset identifies the location of a texel in a cache line. The tag is specifies which data block in memory provided the data in the cache line. The tag is generally stored in a table, such that the tag for the data block stored in each cache line may be read.
A required texel""s address is used in finding that texel in cache. The index is used to identify which cache line may be holding the required texel. The tags of these cache lines are compared against the tag of the required texel. If there is a match, the required texel can be found in the matching cache line at the offset. If there is no match, the data block with the matching tag is retrieved from memory and placed in cache.
There are two methods by which data blocks in the DRAM are written into cache. These are referred to as direct and associative. In direct mapped the index determines the location in cache where a data block may be placed. Each data block in the main memory has one cache line where it may be placed. That is, each cache line is uniquely identified by the index portion of the frame address. The tag identifies the frame address of the data block stored in a cache line. The direct method has the benefit of the simplicity because once a block""s main memory address is known, the location where it may be placed in cache is also known.
The associative method comes in two varieties. In the fully associative method, a data block from memory can be placed in any cache line. In a fully associative cache there is no index signal. This has the advantage of being very flexible, but requires complex circuitry to locate each data block. For example, when attempting to access a texel in cache, the tag for that texel is compared against the tags for every cache line in the cache. In the direct method, since a texel can be placed in only one cache line, only one tag is compared.
A compromise between the direct and fully associative methods is n-way associativity. For example, in 2-way associativity, a data block data may be written into one of two locations in cache. In n-way associativity, there is the advantage that a block in the main memory may be written into more than one location in cache. Furthermore, not all cache line tags need to be compared when looking for a texel, rather n tags are checked.
An inherent drawback to this memory hierarchy scheme becomes apparent when it is contemplated for use in a graphics system as described above. In the CPU requests data from the cache, and a cache miss occurs, the cache requests and receives data from the main memory for presentation to the CPU. Unfortunately, the main memory is much slower than the cache memory and the CPU, thus every cache miss leaves the CPU idle for many CPU clock cycles. This is referred to as cache latency.
But in graphics systems, such as those consistent with embodiments of the present invention, texels are required at the tremendous speeds calculated above. The CPU cannot wait for the cache to retrieve data. This would result in xe2x80x9cjumpyxe2x80x9d or jittery graphic images being displayed. Rather, another solution which eliminates this cache miss latency must be found.
The present invention provides methods and circuitry for addressing the cache miss latency problem by using, in one exemplary embodiment, a first-in first-out (FIFO) apparatus to decouple the cache addressing circuits from the cache itself. The index and offset portions of the addresses are input to the FIFO. The FIFO holds the index and offset for a period of time dependent on the number of entries present in the FIFO. If a fetch from the main memory is required, the fetch can occur as the index and offset progress through the FIFO. A condition under which identical index signals associated with different tags are in the FIFO at the same time. To avoid a potentially improper overwriting of needed data when the overlapping index condition occurs, the present invention uses extra cache lines. The extra cache lines are not addressable by the index signals. Rather, according to a specific embodiment, one level of indirection is used. That is, index signals are translated by a read table to one of a number of cache line addresses. This number of cache line addresses is less than the total number of cache lines in the cache. The extra cache lines are addressable by a write table that directs the transfer of data from the main memory to the cache. When transferred data is needed, the appropriate cache line address in the read table is swapped for the appropriate cache line address in the write table, and the updated read table is used.
Accordingly, in one embodiment, the present invention provides a cache memory apparatus including a cache memory having a first number of cache lines, each cache line addressable by a cache line address; a first plurality of storage elements coupled to a first address bus; and a second plurality of storage elements coupled to the first plurality of storage elements. The first plurality of storage elements holds a second number of cache line addresses, and the second plurality of storage elements holds a third number of cache line addresses.
In another embodiment, the present invention provides a method of reading data from a cache line. The method comprises providing an address comprising an index; providing a fetch status, capable of having a value; and translating the index to a first cache line address. If the fetch status has a first value, data is read from a cache line identified by the first cache line address, otherwise the first cache line address is replaced with a second cache line address, and data is read from a cache line identified by the second cache line address.
In yet another embodiment, the present invention provides a cache system including a read queue, capable of queuing a plurality of index signals; a cache having a third number of cache lines; a first table comprising a first number of storage elements, wherein each storage element contains a cache line address; and a second table comprising a second number of storage elements, wherein each storage element contains a cache line address. The system also has a synchronizer, coupled between the first table and the second table; a read handler, coupled between the first table and the cache; and a write handler, coupled between the synchronizer and the cache.