1. Field of the Invention
The invention relates generally to digital electronic circuitry and, more particularly, to a data prefetch mechanism for different type of workloads in a microprocessor.
2. Description of the Related Art
A known technique to increase processor performance is to layer a memory subsystem into several levels of memory known as caches. The caches nearest the main central processing unit (CPU) are typically small and fast. The caches further away from the main CPU get larger and slower.
Another known technique to improve processor performance is to prefetch data into a cache that is closest to the main CPU. This technique helps eliminate latency for fetching data from a remote cache or memory. Many instruction set architectures have added instructions used to prefetch data from a memory into the processor's cache hierarchy. If software can predict far enough in advance the memory locations that a program will subsequently use, these instructions can be used to effectively eliminate the cache miss latency.
One way of providing software prefetching has been classified as synchronous, software-directed prefetching. The prefetching is considered synchronous, when the prefetch hint usually specifies a small amount of memory, such as a single cache line. Also the instruction can be executed in program order like any other load instruction. Instructions called data cache block touch (DCBT) in the PowerPC™ architecture are examples of synchronous prefetch instructions.
Another instruction class of prefetch instructions is considered asynchronous, when the instructions can specify a very large amount of memory to be prefetched in increments of cache line units. A stream controller can be used to run independently of normal load and store instructions.
There are certain workloads that can most effectively take advantage of prefetch software and hardware techniques. One type of such workloads in microprocessor systems is referred to as streaming data. In this type of workload, large amounts of data are streamed into the core microprocessor. This data is often used for a one-time computation and then transferred back out. This would be common for a graphics type workload. Quite often, a working set of such streaming data is much larger than the capacity of the Level 1 (L1) cache. Also, the data quite often only gets used one time.
Previous streaming mechanisms have used hardware intensive data prefetch mechanisms to stream data into the L1 Cache. These schemes involve extensive additional hardware to detect streams and prefetch ahead of demand loads. Also, when data is streamed into the L1 cache, it often displaces other useful data in the cache.
Therefore, there is a need for a mechanism for providing different data prefetch schemes for different type of workloads.