1. The Field of the Invention
This invention relates generally to cache memory. More specifically, the invention relates to a new method and apparatus for controlling a frame buffer cache memory so as to increase throughput by hiding cache misses, and minimizing reducing latency for cache hits.
2. The State of the Art
One of the traditional bottlenecks of 3D graphics rendering hardware is the rate at which pixels are capable of being rendered into a frame buffer. Modern computer systems are tasked with rendering increasingly more detailed three dimensional environments at frame rates which attempt to portray fluid motion on a display. Unfortunately, it is a challenge to deliver such performance at desktop computer prices.
The challenges to rendering richly textured three dimensional environments on a computer display are detailed in Deering, Michael F., Schapp, Stephen A., Lavelle, Michael G., FBRAM: A New Form of Memory Optimized for 3D Graphics, Computer Graphics Proceedings, Annual Conference Series, 1994, published by Siggraph. The article explains that the performance of hidden surface elimination algorithms has been limited by the pixel fill rate of 2D projections of 3D primitives.
When trying to increase fill rates and rendering rates, designers have generally been forced to make a tradeoff between latency and throughput. Essentially, latency has been sacrificed to achieve greater throughput. If high throughput is desired, cache misses are hidden by pipelining accesses to cache memory (hereinafter simply referred to only as cache). The number of states in the pipeline is equal to the worst case time required to load a slot in the cache. This effectively delayed cache access to the point that even in the case of a miss in a system having two levels of cache, the pipeline would not have to halt because the cache is always capable of being loaded by the time the access was actually performed.
Regarding the two levels of cache mentioned above, two levels of cache are implemented when controlling the cache of a frame buffer. The first level comprises an SRAM pixel buffer. The second level comprises implementation of sense amps on the banks of DRAM. This is also explained in the Deering et al article. The present invention is directed mainly to improving cache performance at the first level. However, the result is an improvement in single and multi-level cache.
As explained previously, the consequence of implementing cache pipelining to increase throughput is added latency on cache hits. In other words, if an access was required that happened to be a hit, the access would be delayed by the entire built-in pipeline delay, even though it is immediately accessible in the cache. The delay would also occur even when there are no valid accesses ahead of it in the pipeline. An xe2x80x9caccessxe2x80x9d is defined as an attempt to retrieve data from or send data to the cache, but if not a hit, then from the DRAM.
This degree of latency could not always be tolerated. The alternative was to allow hit accesses to be executed without delay. However, when a miss occurred, processing had to stop until the cache was loaded. Thus it is easy to recognize the delays in throughput. This type of frame buffer cache controller is implemented in many systems today.
In the frame buffer of a graphics system, maximum throughput is generally the most important consideration. For example, 3D rendering produces a more or less constant stream of pixels to be written. 3D fill rate performance on a cached graphics system is directly proportional to the percentage of cache accesses that can be hidden using pipelining.
In contrast, memory mapped accesses from a host computer are not continuous, but are usually separated by one or more dead states. Because these accesses occur via the global system PCI or AGP bus, it is important that they tie up the bus for the least amount of time possible. Therefore, this situation requires minimal latency.
For example, in a series of single pixel read operations, the transfer of data is held up until valid data from the frame buffer is ready. If the read is a hit, it is undesirable that this time would include the same latency as if it were a miss at both levels of the cache (as would occur if accesses were pipelined for maximum throughput).
The prior art cache controllers also teach reading in each block that is to be manipulated in cache from DRAM, and always writing back each block to DRAM to thereby make sure that the latest data is always available from DRAM. This function basically ignores, for example, situations when the data in cache is read-only, and does not need to be written back to DRAM which otherwise causes excessive cache traffic.
Therefore, it would be an advantage over the state of the art to provide a multi-level cache controller which is able to automatically adjust and provide either high throughput or reduced latency, depending upon the circumstances. Another improvement would be to reduce overall cache traffic to thereby free up the system bus.
It is an object of the present invention to provide a method and apparatus for balancing cache throughput and latency in accordance with the type of accesses being made of cache.
It is also an object to provide a method and apparatus for increasing cache throughput for generally continuous accesses of cache.
It is a further object to provide a method and apparatus for providing reduced cache latency for generally intermittent cache accesses.
It is still another object to provide a method and apparatus for implementing an expandable and collapsible cache pipeline for balancing cache throughput and latency.
It is an additional object to provide a method and apparatus for implementing an expandable and collapsible cache pipeline which is able to adjust to different DRAM speeds and cache controller clock rates.
It is another object to provide a method and apparatus for reducing cache traffic to thereby free up the system or graphics bus.
It is still a further object to provide a method and apparatus for separating a read cache from a write cache, to thereby reduce bus traffic.
It is another object to provide a method and apparatus for reducing bus traffic by only reading those blocks into cache memory which need to be read, and only writing those blocks back to cache memory that must be written back to maintain currency of data.
It is also an object to provide apparatus that enables accesses to cache to be executed at the earliest possible state that will result in a valid access.
It is another object to provide a method and apparatus for enabling effective parallel processing of DRAM accesses and cache accesses.
The presently preferred embodiment of the invention is a method and apparatus for providing an expandable cache pipeline in the form of a first in, first out (FIFO) memory for interfacing with separate read and write caches of a frame buffer, for example, wherein selective reading from DRAM (or other memory) and writing to DRAM (or other memory) reduces bus traffic, thereby increasing throughput. Throughput is also increased (or latency reduced) by providing an expandable cache pipeline.
In a first aspect of the invention, an interlock unit adjusts delays in pixel read/write paths of a graphics display system by allowing or preventing accesses to cache until the earliest possible state that will result in a valid access.
In a second aspect of the invention, the cache pipeline expands when there is a continuous stream of pixels being received faster than the frame buffer can accept. The pipeline collapses, causing the FIFO memory to empty, when the frame buffer is able to accept the pixels faster than they are being supplied.
In a further aspect of the invention, a reduction in cache traffic is achieved by providing separate read and write caches.
These and other objects, features, advantages and alternative aspects of the present invention will become apparent to one skilled in the art from a consideration of the following detailed description taken in combination with the accompanying drawings.