A conventional processor typically operates at a much faster speed than the main memory to which the processor is coupled. To overcome the inherent latency of main memory, which usually comprises dynamic random access memory (DRAM), a memory hierarchy is employed. The memory hierarchy includes one or more levels of cache, each cache comprising a relatively fast memory device or circuitry configured to hold data recently accessed—or expected to be accessed—by the processor. The purpose of the cache is to insure most data needed by a processor is readily available to the processor without accessing the main memory, as the process of accessing main memory is very slow in comparison to the speed of the processor or the speed at which the processor can access a cache.
Typically, a memory hierarchy comprises multiple levels of cache, wherein each level is faster than next lower level and the level closest to the processor exhibits the highest speed and performance. A cache may be located on the processor itself—i.e., an “on-chip” cache—or a cache may comprise an external memory device—i.e., an “offchip” cache. For example, a processor may include a high level on-chip cache—often times referred to as an “L1” cache—wherein the processor is coupled with a lower level off-chip cache—which is often referred to as an “L2” cache. Alternatively, a processor may include an on-chip L1 cache, as well as an on-chip L2 cache. Of course, a memory hierarchy may include any suitable number of caches, each of the caches located on-chip or off-chip.
As noted above, each level of cache may hold data recently accessed by the processor, such recently accessed data being highly likely—due to the principles of temporal and spatial locality—to be needed by the processor again in the near future. However, system performance may be further enhanced—and memory latency reduced by anticipating the needs of a processor. If data needed by a processor in the near future can be predicted with some degree of accuracy, this data can be fetched in advance—or “prefetched”—such that the data is cached and readily available to the processor. Generally, some type of algorithm is utilized to anticipate the needs of a processor, and the value of any prefetching scheme is dependent upon the degree to which these needs can be accurately predicted.
One conventional type of prefetcher is commonly known as a “stride” prefetcher. A stride prefetcher anticipates the needs of a processor by examining the addresses of data requested by the processor—i.e., a “demand load”—to determine if the requested addresses exhibit a regular pattern. If the processor (or an application executing thereon) is stepping through memory using a constant offset from address to address—i.e., a constant stride—the stride prefetcher attempts to recognize this constant stride and prefetch data according to this recognizable pattern. Stride prefetchers do, however, exhibit a significant drawback. A stride prefetcher does not function well when the address pattern of a series of demand loads is irregular—i.e., there is not a constant stride—such as may occur during dynamic memory allocation.
Another method of data prefetching utilizes a translation look-aside buffer (TLB), which is a cache for virtual-to-physical address translations. According to this method, the “fill contents”—i.e., the requested data—associated with a demand load are examined and, if an address-sized data value matches an address contained in the TLB, the data value likely corresponds to a “pointer load”—i.e., a demand load in which the requested data is an address pointing to a memory location—and is, therefore, deemed to be a candidate address. A prefetch request may then be issued for the candidate address. Because the contents of the requested data—as opposed to addresses thereof—are being examined, this method may be referred to as content-based, or content-aware, prefetching. Such a content-aware prefetching scheme that references the TLB (or, more generally, that references any external source or index of addresses) has a significant limitation: likely addresses are limited to those cached in the TLB, and this constraint significantly reduces the number of prefetch opportunities. Also, this content-aware prefetching scheme requires a large number of accesses to the TLB; thus, additional ports must be added to the TLB to handle the content prefetcher overhead.