1. Field of the Invention
The invention relates generally to a unified or shared cache and more specifically to a dynamically configurable replacement technique to reduce domination by a particular functional unit or an application (e.g. caching instructions or data) by limiting the eviction ability to selected cache regions based on over and/or under utilization of the cache by the particular functional unit or application.
2. Description of Related Art
The following background information is provided to aid in the understanding of the application of the present invention and is not meant to be limiting to the specific examples set forth herein. Displaying 3D graphics is typically characterized by a pipelined process having tessellation, geometry and rendering stages. The tessellation stage is responsible for decomposing an object into geometric primitives (e.g. polygons) for simplified processing while the geometry stage is responsible for transforming (e.g. translating, rotating and projecting) the tessellated object. The rendering stage rasterizes the polygons into pixels and applies visual effects such as, but not limited to, texture mapping, MIP mapping, Z buffering, depth cueing, anti-aliasing and fogging.
The entire 3D graphics pipeline can be embodied in software running on a general purpose CPU core (i.e. integer and floating point units), albeit unacceptably slow. To accelerate performance, the stages of the graphics pipeline are typically shared between the CPU and a dedicated hardware graphics controller (a.k.a. graphics accelerator). The floating-point unit of the CPU typically handles the vector and matrix processing of the tessellation and geometry stages while the graphics controller generally handles the pixel processing of the rendering stage.
Reference is now made to FIG. 1 that depicts a first prior art system of handling 3D graphics display in a computer. Vertex information stored on disk drive 100 is read over a local bus (e.g. the PCI bus) under control by chipset 102 into system memory 104. The vertex information is then read from system memory 104 under control of chipset 102 into the L2 cache 108 and L1 cache 105 of CPU 106. The CPU 106 performs geometry/lighting operations on the vertex information before caching the results along with texture coordinates back into the L1 cache 105, the L2 cache 108 and ultimately back to system memory 104. A direct memory access (DMA) is performed to transfer the geometry/lighting results, texture coordinates and texture maps stored in system memory 104 over the PCI bus into local graphics memory 112 of the graphics controller 110 for use in rendering a frame on the display 114. In addition to storing textures for use with the graphics controller 110, local graphics memory 112 also holds the frame buffer, the z-buffer and commands for the graphics controller 110.
A drawback with this approach is inefficient use of memory resources since redundant copies of texture maps are maintained in both system memory 104 and the local graphics memory 112. Another drawback with this approach is the local graphics memory 112 is dedicated to the graphics controller 110, is more expensive than generalized system memory and is not available for general-purpose use by the CPU 106. Yet another drawback with this approach is the attendant bus contention and relatively low bandwidth associated with the shared PCI bus. Efforts have been made to ameliorate these limitations by designating a xe2x80x9cswap areaxe2x80x9d in local graphics memory 112 (sometimes misdescriptively referred to as an off chip L2 cache) so that textures can be prefetched into local graphics memory 112 from system memory 104 before they are needed by the graphics controller 110 and swapped with less recently used textures residing in the texture cache of the graphics controller 110. The local graphics memory swap area merely holds textures local to the graphics card (to avoid bus transfers) and does not truly back the texture cache as would a second level in a multi-level texture cache. This approach leads to the problem, among others, of deciding how to divide the local graphics memory 112 into texture storage and swap area. Still yet another drawback with this approach is the single level texture cache in prior art graphics controllers consume large amounts of die area since the texture cache must be multi-ported and be of sufficient size to avoid performance issues.
Reference is now made to FIG. 2 that depicts an improved but not entirely satisfactory prior art system of handling 3D graphics display in a computer. The processor 120, such as the Pentium II(trademark) processor from Intel corporation of Santa Clara Calif., comprises a CPU 106 coupled to an integrated L2 cache 108 over a so-called xe2x80x9cbacksidexe2x80x9d bus 126 that operates independently from the host or so-called xe2x80x9cfront-sidexe2x80x9d bus 128. The system depicted in FIG. 2 additionally differs from that in FIG. 1 in that the graphics controller 110 is coupled over a dedicated and faster AGP bus 130 through chipset 102 to system memory 104. The dedicated and faster AGP bus 130 permits the graphics controller 110 to directly use texture maps in system memory 104 during the rendering stage rather than first pre-fetching the textures to local graphics memory 112.
Although sourcing texture maps directly out of system memory 104 mitigates local graphics memory constraints, some amount of local graphics memory 112 is still required for screen refresh, Z-buffering and front and back buffering since the AGP bus 130 cannot support such bandwidth requirements. Consequently, the system of FIG. 2 suffers from the same drawbacks as the system of FIG. 1, albeit to a lesser degree. Moreover, there is no way for the graphics controller 110 to directly access the L2 cache 108 that is encapsulated within the processor 120 and connected to the CPU 106 over the backside bus 126.
From the foregoing it can be seen that memory components, bus protocols and die size are the ultimate bottleneck for presenting 3D graphics. Accordingly, there is a need for a highly integrated multimedia processor having tightly coupled central processing and graphical functional units that share a relatively large cache to avoid slow system memory access and the requirement to maintain separate and redundant local graphics memory. Moreover, there is a need to avoid polluting the shared cache resulting from storing a significant quantity of graphics data in the shared cache to a point that a significant amount of non-graphics data needed by the central processing unit is evicted from the shared cache such that the performance of the central processing unit is effected.
To overcome the limitations of the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a dynamically configurable cache replacement technique in a shared or unified cache to reduce domination by a particular functional unit or an application such as unified instruction/data caching by limiting the eviction ability to selected cache regions based on over and/or under utilization of the cache by the particular functional unit or application. A specific application of the present invention includes a highly integrated multimedia processor employing a tightly coupled shared cache between central processing and graphics units wherein the eviction ability of the graphics unit is limited to selected cache regions when the graphics unit over utilizes the cache. Dynamic configurability can take the form of a programmable register that enables either one of a plurality of replacement modes based on captured statistics such as measurement of cache misses and/or hits by a particular functional unit or application.
A feature of the present invention is providing the graphics unit access to data generated by the central processing unit before the data is written-back or written-through to system memory without significantly polluting the shared cache.
Another feature of the present invention is reduction of the system memory bandwidth required by the central processing and graphics units.
Another feature of the present invention is pushing data transfer bottlenecks needed for 3D graphics display into system memory such that system performance will scale as more advanced memories become available.
These and various other objects, features, and advantages of novelty which characterize the invention are pointed out with particularity in the claims annexed hereto and forming a part hereof. However, for a better understanding of the invention, its advantages, and the objects obtained by its use, reference should be made to the drawings which form a further part hereof, and to the accompanying descriptive matter, in which there is illustrated and described a specific example of a dynamic replacement technique in a shared cache in accordance with the principles of the present invention.