1. Technical Field
The present invention relates generally to data processing systems and more particularly to fetching data for utilization during data processing. Still more particularly, the present invention relates to data prefetching operations in a data processing system.
2. Description of Related Art
Prefetching of data during data processing is well-known in the art. Conventional computer systems are designed with a memory hierarchy comprising different memory devices with increasing access latency the further the device is away from the processor. The processors typically operate at a very high speed and are capable of executing instructions at such a fast rate that it is necessary to prefetch a sufficient number of cache lines of data from lower level (and/or system memory) to avoid the long latencies when a cache miss occurs. This prefetching ensures that the data is ready and available when needed for utilization by the processor.
Data prefetching is a proven, effective way to hide increasing memory latency from the processor's execution units. On these processors, data prefetch requests are issued as early as possible in order to “hide” the cache access latencies and thus allow ensuing dependent data operations (load requests) to execute with minimal delay in the return of data to the execution units. However, the early prefetching may results in an early/prematurely return, before the data are required/demanded by the execution units, and the cache line may be replaced in the cache/prefetch buffer before the fetched data is demanded by the processor. The processor then stalls, while waiting for the data to be fetched again.
Standard prefetch operations involve a prefetch engine that monitors accesses to the L1 cache and, based on the observed patterns, issues requests for data that is likely to be referenced in the future. If the prefetch request succeeds, the processor's request for data will be resolved by loading the data from the L1 cache on demand, rather than the processor stalling while waiting for the data to be fetched/returned from lower level memory.
When prefetching data, the prefetch engines utilize some set sequence for establishing the stream of cache lines to be fetched. For example, a large number of prefetch engines detect data streams that access cache lines in a sequential manner (e.g., cache line 5, followed by cache line 6, then cache line 7) or in a reverse sequence (cache line 7, followed by cache line 6, then cache line 5). Other prefetch engines, however, detect data streams that are referenced in “strides” (e.g., cache line 5 followed by cache line 8, then cache line 11, where the stride pattern is 3).
In order to track information about data streams (i.e., sequences of data references that are somehow correlated), the prefetch engines for some processor configurations, such as the POWER processors of International Business Machines, utilize a series of tables. Specifically, conventional prefetch engines utilizes two tables to track current streams of data, including a filter table to identify candidate streams, and a request table to hold currently active streams. Each table provides a queue for holding fetched cache lines, namely the prefetch filter queue and the prefetch request queue. When a miss to cache line A is detected, an entry in the filter table is allocated. If a miss to cache line A+1 follows, the “miss” information is now moved to the table that maintains information about current streams, and the prefetch engine will begin to issue requests for sequential cache lines (e.g., A+2, A+3 . . . ).
In some scenarios, only 1 access is required to start a stream. In these situations, some other metric may also be used to start the steam. For example, if the byte accessed is halfway or more through the cache line, the prefetch engine fetches the next line, and if the byte accessed is in the lower half of the cache line, the prefetch engine fetches the previous line. Since there are limited numbers of slots in the queues of both the filter and current stream tables, streams may write over other streams and cause prefetches in the replaced stream to stop.
Many modern processors concurrently execute different threads and/or branches within a thread, which require prefetching of different streams of data for each thread. With these types of processor and/or processor execution, the prefetch engine has to prefetch data for more than one stream. In some configurations, the prefetch engine performs the prefetching of the different streams in a round-robin manner. This round-robin implementation enables all streams to have consistently equal access to prefetched data. However, some streams are more important than others, particularly those streams whose load data are utilized sooner than the others. With the round-robin implementation, all streams are considered equal, which leads to potential misses for the streams with higher priority (i.e., streams whose data are utilized a faster rate). Misses may also occur due to data replacement in the small L1 cache for those slower streams whose data are not utilized quickly enough before being replaced by data from the next streams.