1. Field of the Invention
This invention relates to computing systems, and more particularly, to efficient cache line prefetching.
2. Description of the Relevant Art
Modern microprocessors may include one or more processor cores, or processors, wherein each processor is capable of executing instructions of a software application. These processors are typically pipelined, wherein the processors include one or more data processing stages connected in series with storage elements (e.g. registers and arrays) placed between the stages. Ideally, every clock cycle produces useful execution of an instruction for each stage of a pipeline. However, a stall in a pipeline may cause no useful work to be performed during that particular pipeline stage. One example of a stall, which typically is a multi-cycle stall, is a data-cache or an instruction-cache miss. There may be a substantial latency associated with retrieving data from higher level caches and/or system memory. This latency, which is the total number of processor cycles required to retrieve data from memory, has been growing rapidly as processor frequencies have increased faster than system memory access times.
In various embodiments, system memory may comprise two or more levels of cache hierarchy for a processor. Later levels in the hierarchy of the system memory may include access via a memory controller to dynamic random-access memory (DRAM), dual in-line memory modules (dimms), a hard disk, or otherwise. Access to these lower levels of memory may require a significant number of clock cycles. The multiple levels of caches that may be shared among multiple cores on a multi-core microprocessor help to alleviate this latency when there is a cache hit. However, as cache sizes increase and later levels of the cache hierarchy are placed farther away from the processor core(s), the latency to determine if a requested memory line exists in a cache also increases. Should a processor core have a memory request followed by a serial or parallel access of each level of cache where there is no hit, followed by a DRAM access, the overall latency to service the memory request may become significant.
One solution for reducing overall performance decline due to the above problem is overlapping a cache line fill transaction resulting from a cache miss with out-of-order execution of multiple instructions per clock cycle. However, a stall of several clock cycles still reduces the performance of the processor due to in-order retirement that may prevent complete overlap of the stall cycles with useful work. Another solution is to use a speculative prefetch request to lower level memory, such as DRAM, of a predetermined number of cache lines ahead of the data currently being processed. This prefetch request may be in series or in parallel with the current memory request to the cache subsystem of one or more levels. Therefore, after the current memory request the latency to access subsequent memory requests from the memory hierarchy may be greatly reduced. The data may already be residing in the cache, in the memory controller, or may shortly arrive in the memory controller due to the earlier speculative prefetch request.
A data stream prefetch unit has been used to detect data streams. A data stream may be defined as a sequence of memory accesses referencing contiguous blocks of data. The contiguous blocks of data may be stored in one or more levels of a memory hierarchy. For example, the blocks of data may be stored in main memory and may be read out and sent to one or more caches in higher levels of a memory hierarchy. The conveying of this contiguous block of data from lower levels to higher levels of a memory hierarchy may be due to a cache line fill transaction. Alternatively, the conveying of this contiguous block of data may be due to a prefetch transaction. In one example, a data stream may be used in the execution of an algorithm for sharpening images or pixels. Such an algorithm may use the following expression in a loop: a[i]=b[i]+c[i].
Detecting a data stream may include identifying a sequence of memory accesses referencing a contiguous set of cache lines in a monotonically increasing or decreasing manner. In response to detecting a data stream, a data stream prefetch unit may begin prefetching a predetermined number of cache lines ahead of the currently requested cache line. A data stream prefetch unit tracks a data stream with interspersed load and store accesses (henceforth called a mixed access data stream) and ignores the type of access (load or a store) of the miss address. As used herein, a store operation, or instruction, is a write access while a load operation, or instruction, is a read access. Therefore, the data stream prefetch unit prefetches all cache lines in a read-only state. This read-only state may, for example, be associated with a MOESI cache coherency protocol. A first store operation in the demand access stream from the processor that hits on a prefetched line is required to issue a state change request. The state change request acquires permission to write to the cache line. The state change request reduces the benefits of prefetching the cache line.
In view of the above, efficient methods and mechanisms for efficient cache line prefetching are desired.