Many processor architectures implement a hierarchical memory architecture that includes on-chip, high-speed cache units for storing data located in off-chip memory in a temporary data store that is quickly accessed by processing units on-chip. For example, the processor may be coupled to a synchronous dynamic random access memory (SDRAM) chip from which data may be loaded into on-chip cache units. The cache units may be distributed in a hierarchical manner such that a level 2 (L2) cache is shared among a plurality of cores of the processor, and each core is also associated with a level 1 (L1) cache that corresponds with that core. A thread executing on a particular core may access memory in the SDRAM by transmitting a memory access request to the corresponding L1 cache for that core. The L1 cache checks the data currently stored in the cache to determine if there is a cache hit (i.e., the data associated with the address in the memory access request currently resides in the cache). If there is a cache hit, then the L1 cache unit returns the data to the core to be processed by the thread. However, if there is a cache miss (i.e., the data associated with the address in the memory access request is not currently in the cache), then the L1 cache may transmit the memory access request to the L2 cache to determine if the data is stored in a higher level of the cache hierarchy. The L2 cache may return the data if there is a cache hit or may transmit the memory access request to the external memory to retrieve the data if there is a cache miss. Different architectures may have different numbers of hierarchical levels of cache units (e.g., L1, L2, and L3).
Each cache is limited in size and may store a particular number of cache lines. For example, an L2 cache may be 2048 KB (kilobytes) while a L1 cache may be 64 KB, with each cache line being 512 B (bytes) or 1024 B in size. Thus, each cache may implement an eviction policy that determines when a particular cache line is evicted from the cache to make room for a new cache line. Example eviction policies are based on access order priority, most recently used (MRU), and least recently used (LRU). Other eviction policies are well-known in the art as well.
It can be difficult for a processor to fetch data at a high rate using regular load instructions. For example, a thread may issue a load instruction that causes a memory access request to be transmitted to the L1 cache. If that memory access request results in a cache miss, then the thread will stall and wait for the data to be fetched into the L1 cache. The thread may stall for hundreds or even thousands of cycles while waiting for the data to be loaded from the memory. In conventional systems, registers will be allocated when the load instruction is issued and, therefore, these registers may sit unused while the thread waits for the data to be fetched. In addition to the allocated registers, other execution resources associated with the stalled threads may also remain idle waiting for the data to be fetched. The inefficiency is compounded in multi-threaded processors where different threads may issue load instructions with addresses corresponding to the same cache line as the previously issued memory access request. With prefetching, a first prefetch load instruction may be used to fetch data from a memory into a cache unit and a second demand load instruction may be used to load the data from the cache unit into a register file. There may be multiple prefetch load memory access requests corresponding to the same cache line; for example, different threads in a multi-threaded processor may issue different prefetch load memory access requests, and these memory access requests may be coalesced by the L1 cache unit. Once the data is fetched into the L1 cache, both threads may separately load the data from the L1 cache into a register file in order to process the data. The load operations may be completed many cycles apart while each thread waits to be activated. Furthermore, each load instruction may increase or decrease the cache line's priority for replacement in association with the eviction policy. By coupling the eviction mechanism to the load instruction, it is possible that a particular cache line is evicted after one thread loads the data into the register file but before the other thread has loaded the data into the register file, necessitating the data be fetched into the L1 cache a second time and causing the second thread to stall even further. It is also possible that the cache line is prioritized to remain in the cache for longer than necessary after the demand load has completed, wasting valuable cache capacity while other cache lines are evicted. These types of inefficiencies should be avoided. Thus, there is a need for addressing these issues and/or other issues associated with the prior art.