As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
In addition, as processor architectures improve in terms of raw performance, other considerations, such as the communication costs of storing and retrieving data, become significant factors in overall performance. Data is typically organized within a memory address space that represents the addressable range of memory addresses that can be accessed by a processor. Both the instructions forming a computer program and the data operated upon by those instructions are often stored in a memory system and retrieved as necessary by a processor when executing the computer program. In order to balance cost, performance, and storage capacity, multi-level memory architectures have been developed.
Often, a computer relies on a relatively large, slow and inexpensive mass storage system such as a hard disk drive or other external storage device, an intermediate main memory that uses dynamic random access memory devices (DRAM's) or other volatile memory storage devices, and one or more high speed, limited capacity cache memories, or caches, implemented with static random access memory devices (SRAM's) or the like (e.g., L1, L2, L3, etc. caches). In some instances, instructions and data are stored in separate instruction and data cache memories to permit instructions and data to be accessed in parallel. One or more memory controllers are then used to swap the information from segments of memory addresses, often known as “cache lines”, between the various memory levels to attempt to maximize the frequency that requested memory addresses are stored in the fastest cache memory accessible by the microprocessor. Whenever a memory access request attempts to access a memory address that is not cached in a cache memory, a “cache miss” occurs. As a result of a cache miss, the cache line for a memory address typically must be retrieved from a relatively slow, lower level memory, often with a significant performance hit.
In order to minimize cache misses, it is desirable to maintain in each cache data that is long lived and frequently used, as the more the data is accessed while in the cache, the greater the performance benefit obtained as a result of loading the data into the cache. While in some designs a performance penalty exists for initially loading data into a cache, in most designs the data is loaded into a cache in parallel with retrieving the data from a lower level memory, so there is little or no additional performance penalty beyond the penalty of retrieving the data from the lower level memory.
It has been found, however, that for certain types of data, loading the data into the cache offers little or no performance benefit, and in fact, may degrade performance by limiting the amount of space in a cache that is used for other data. As one example, in image processing applications, vertex data describing geometric objects to be placed in a scene is often stored in structures along with attributes associated with the vertices. This data may be used by high performance execution units in a processor, e.g., single instruction multiple data (SIMD) or vector execution units, to generate and place primitives in a two dimensional representation of a scene.
The vertex structures can be relatively large in size due to the vectorized nature of the data, and in conventional vertex processor implementations, the vertex structures are loaded into a register file in a vector execution unit during processing of a scene by the vertex processor. In many conventional designs, the retrieval of vertex structures into a register file is accompanied by caching of these structures in one or more levels of caches in the vertex processor. For example, in one conventional design, a vertex processor includes a relatively large, shared L2 cache and separate smaller, faster L1 data and instruction caches. Retrieval of vertex structures results in the vertex structures being cached in both the L1 data and L2 caches, as well as being stored in a register file.
However, the vertex positions can be different from frame-to-frame, and as such, many vertex structures are used only on one frame, and may only be accessed a limited number of times within that one frame. In this regard, this type of data is referred to herein as single use data. Furthermore, as noted above, the vertex structures can be relatively large, and in many cases vertex processing only requires access to vertex position data from the vertex structures, with the remainder of the data in the vertex structures going unused. The combination of these factors often results in low L1 data cache hit rates on vertex positions data. In addition, if only the vertex position is used for most computations, this means that large portions of the L1 data cache, and thus memory bandwidth, are not utilized efficiently. In addition, other data that is frequently used, e.g., local variables or program stacks, may be routinely cast out of the L1 data cache as new vertex structures are loaded into the cache.
In some conventional caching architectures, some data that is retrieved from a lower level memory is not stored in a cache. In some architectures, for example, retrieved data may bypass every cache in the hierarchy (e.g., an L1 and an L2 cache), and be stored directly in a destination such as a register, buffer or register file. In many instances, however, bypassing all caches in a hierarchy may not offer optimal performance in the event that any of the data is needed again, a high cost retrieval from the lower level memory is once again required. In other architectures, retrieved data may bypass a lower level cache (e.g., an L2 cache) in favor of storage in a higher level cache (e.g., an L1 cache). For data such as vertex structures, as described above, it is to a significant extent the relatively large size of the vertex structures as compared to the L1 cache that causes the low hit rate, so it has been found that caching single use data of this nature in the L1 cache, rather than the L2 cache, causes a greater bottleneck in performance.
Therefore, a need continues to exist in the art for a manner of improving memory access performance in multi-level memory architecture to maximize the performance of retrieving single use data.