1. Field of the Invention
This invention relates to computing systems, and more particularly, to data allocation in a shared cache memory of a computing system.
2. Description of the Relevant Art
Modern microprocessors may include one or more processor cores, or processors, wherein each processor is capable of executing instructions of a software application. These processors are typically pipelined, wherein the processors include one or more data processing stages connected in series with storage elements (e.g. registers and arrays) placed between the stages. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
Ideally, every clock cycle produces useful execution of an instruction for each stage of a pipeline. However, a stall in a pipeline may cause no useful work to be performed during that particular pipeline stage. One example of a stall, which typically is a multi-cycle stall, is a data-cache or an instruction-cache missA method to reduce performance loss is to reduce the occurrence of multi-cycle stalls.
A microprocessor may be coupled to one or more levels of a cache hierarchy in order to reduce the latency of the microprocessor's request of data in memory for a read or a write operation. Generally, a cache may store one or more blocks, each of which is a copy of data stored at a corresponding address in the system memory. As used herein, a “block” is a set of bytes stored in contiguous memory locations, which are treated as a unit for coherency purposes. As used herein, the terms “cache block”, “block”, “cache line”, and “line” are interchangeable. In some embodiments, a block may also be the unit of allocation and deallocation in a cache. The number of bytes in a block may be varied according to design choice, and may be of any size. As an example, 32 byte and 64 byte blocks are often used.
Conflict, or collision, misses occur within a set-associative or a direct-mapped cache when too many blocks map to a same set, and, therefore, one or more blocks need be discarded within that set. A particular cache may have two or more sources that use the cache for data storage. For example, one or more processor cores, one or more graphics processing units (GPUs), one or more input/output (I/O) devices, or other may access a particular cache. These sources may not equally share the storage resources of the cache. The system performance degradation experienced from an increased average memory latency of a source may not be equal for each source.
For example, a set-associative shared cache, which may be a L2 or a L3 cache in a memory hierarchy, may be utilized for data storage by two on-die processor cores, or cores, a graphics processor unit (GPU), and multiple input/output (I/O) devices. To allocate space for new data to be stored in this cache, such as a fill cache line following a cache miss, typical cache implementations allow the Least Recently Used (LRU) way within a chosen set to be replaced. Since this allocation strategy does not distinguish requests from different sources, it may allow one source, such as an I/O source, to occupy a significant portion of the ways in the cache, which may be undesirable. Data blocks of a core may now be evicted within this chosen set, which degrades the average memory latency seen by the core in the system.
In view of the above, efficient methods and mechanisms for data allocation in a memory of a computing system are desired.