The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent the work is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Multiprocessing computer systems employ multi-core processors that include two or more device cores, each device core associated with a corresponding level-one (L1) cache. A device core may be any one of a central processing unit (CPU), a graphical processing unit (GPU), an accelerated processing unit (APU), a video processing unit (VPU), and the like.
A cache is memory that can provide data to a device core with a latency that is lower than a latency associated with obtaining the data from a main memory. An L1 cache is a first level local memory that holds cached data for a corresponding device core.
The L1 caches of these devices may share data. When a device core needs data that is missing in the local L1 cache, such data may be obtained from a remote L1 cache if the remote L1 cache stores such data. In some cases, these multiprocessing computer systems may further include other types of caches such as level-two (L2) caches, level-three (L3) caches, and so forth that may have higher latencies than the L1 caches.
The data that are moved between an L1 cache and the main memory are typically moved in blocks of data referred to as cache lines. Each cache line will typically be associated with an address, which may be the main memory address where the cache line is stored. When a device core loads a cache line to the L1 cache, a load request indicating the address of the cache line may be sent to the main memory.
Copies of data from the same main memory addresses may be stored in different caches. In multicore systems, each device core may be paired with a respective L1 cache, and a plurality of the L1 caches may have respective copies of the data from the same main memory address.
When one of the copies is modified without the other copies being similarly modified, the data stored in the different L1 caches becomes incoherent. In order to ensure coherency of data stored in different caches, many systems employ cache coherency protocols whereby cache lines stored in cache memory, and more particularly, the addresses of these cache lines, are tagged with one of several different states.
An example of a coherency protocol is the Modified-Owned-Shared-Invalid (MOSI) protocol where a cache line, and more specifically, the cache line address, can be tagged with one of four states, Modified (M), Shared (S), Invalid (I), and Owned (O). These states of a cache line will be specific to a specific device core/cache pair.
In the MOSI protocol when a cache line is tagged Modified in the local cache, the cache line has been modified in the local cache by the local device core. In this situation, if a copy of the same cache line prior to modification was being stored in another cache, the copy (or its associated address) will be tagged as Invalid in the other cache.
When a cache line is tagged Shared in the local cache, the cache line is being shared with one or more other caches and each copy of the data has the same value as the corresponding location in the main memory.
When a cache line is tagged Invalid in the local cache, the cache line that is stored locally in the local cache is invalid (e.g., when a copy of the cache line in another cache has been modified by a remote device core, the local copy will become invalid).
When a cache line is tagged Owned in the local cache, the local device core has the exclusive right to make changes to it and is responsible for responding to any requests from, for example, other caches or from direct memory access (DMA) for the tagged cache line. When a local device core making writes to the Owned-tagged cache line, the local cache may broadcast those changes to the other caches without updating the main memory. The cache with the Owned-tagged cache line writes the cache line back to main memory when the cache line is evicted from the cache.
When a device core needs to load a cache line that is not currently loaded in the corresponding L1 cache, the device core may need to be stalled while it waits for the L1 cache to fetch the cache line.
There are two sources where the cache line may be fetch from when an L1 cache does not currently have a copy of the cache line. The first source is one of the other L1 caches (e.g., remote caches). The second source is the main memory. Obtaining a copy of the cache line from the main memory may be very time consuming and may reduce system performance.
Computer systems may employ software or hardware prefetching mechanisms in order to reduce the effects of cache miss latencies. These prefetching mechanisms may generate load requests for data before the data is requested by the device core in order to load the data into the cache before the device core actually needs the data. Once the requested data has been loaded to the cache, the device core may access the data in the cache by generating a read or write request.
Some systems may employ hardware-based prefetch methods that have fixed sequence to prefetch data in a fixed address pattern. However, such techniques tend to be rigid and not adaptive to changing conditions. As a result, the prefetched data may cause a large amount of data to be prefetched unnecessarily. The unnecessary prefetching requests generated by such mechanisms increase the congestion of system, and may result in unnecessary prefetched data replacing necessary data already in the cache (that is, cache pollution), causing deterioration of the response time of the device core to subsequent instructions that may need to reuse the necessary cache data.
In order to improve performance of currently available prefetching mechanisms, a number of techniques have been proposed in order to improve or supplement currently employed prefetching mechanisms. For example, one approach is to have a separate prefetching data RAM to avoid the cache pollution. However, because of design constraints, the RAM size cannot be very large and as a result, prefetched cache data has to be replaced frequently. Consequently, a device core request will have little chance to hit the data in the prefetching data RAM