Advances in semi-conductor processing and logic design have permitted an increase in the amount of logic that may be present on integrated circuit devices. As a result, computer system configurations have evolved from a single or multiple integrated circuits in a system to multiple cores, multiple hardware threads, and multiple logical processors present on individual integrated circuits. A processor or integrated circuit typically comprises a single physical processor die, where the processor die may include any number of cores, hardware threads, or logical processors.
The ever increasing number of processing elements—cores, hardware threads, and logical processors—on integrated circuits enables more tasks to be accomplished in parallel. In addition, to keep the multiple processing elements busy and to optimize execution, new techniques have been created, such as prefetching. Prefetching of data often provides high performance for many workloads once patterns are identified and data is prefetched before it's demanded by the program, because accesses to cache hierarchies are typically lower latency the closer the cache gets to execution units.
However, prefetch generation for multiple processing elements is complex due to a number of considerations. First, processing elements may run different types of workloads; some of which may benefit from prefetching and others that may not. Second, prefetches from multiple processing elements may compete for space in shared caches, which may also displace important, key data to be utilized by other processing elements. Third, a processing element should highly utilize memory bandwidth, while not generating inefficient prefetches.
As an example, when excess, inefficient prefetches are generated by multiple processing elements, different interconnect bandwidth, such as memory interconnect bandwidth, becomes saturated. Furthermore, the excess prefetches potentially pollute the cache memory, which can lead to loss of performance and wasted power in comparison to a more accurate number of generated prefetches. These limitations become more acute as the number of processing elements increases.
Yet, it's extremely difficult to design highly accurate prefetches. And, even if more accuracy is obtainable, prefetched data may be evicted before use. As a result, “bad prefetches” may be due to either poor address stream generation—address space inaccuracy—or because data is evicted before use—temporal inaccuracy. Unfortunately, previous prefetch throttling systems have throttled prefetches based on a direct or indirect indication of the number of prefetches within prefetch generators themselves without taking into account bandwidth congestion and prefetch accuracy.