Field of the Invention
The present invention relates in general to prefetching information for a processor, and more particularly to a programmable prefetcher that performs prefetch operations according to a programmed prefetch program.
Description of the Related Art
Processors continue to become more powerful with greater performance at higher efficiency levels. The term “processor” as used herein refers to any type of processing unit, including a microprocessor, a central processing unit (CPU), one or more processing cores, a microcontroller, etc. The term “processor” as used herein also includes any type of processor configuration, such as processing units integrated on a chip or integrated circuit (IC) including those incorporated within a system on a chip (SOC) or the like. Semiconductor manufacturing techniques are continually being improved to increase speed, reduce power consumption and reduce the size of circuitry integrated on a processing chip. The reduction of integration size allows additional functionality to be incorporated within the processing unit. Once a conventional processor is manufactured, however, many of its internal functions and operations are essentially fixed.
Memory access latency is a significant factor that impacts processing performance and efficiency. Processing circuitry is often separated from main memory through multiple layers of circuitry and associated access protocols. For example, a processor may be coupled to an external system memory that stores information needed by the processor, such as instructions (e.g., code), data and other information. Access to the external system memory may be relatively slow since the information must often traverse multiple levels of circuitry, such as a bus interface unit and/or a memory controller and the like, and the external devices often operate with a slower system clock as compared to a faster processor or core clock.
In order to improve performance and efficiency, processors typically incorporate one or more levels of cache memory that locally stores information retrieved from external memory for faster access by processing circuitry. Access to an internal cache is substantially faster since the cache is physically closer, has fewer intermediate circuitry levels, and often operates at a faster clock speed. The processor executes load-type instructions with an address for accessing the requested information (e.g., data or instructions). When the requested information is located in an internal cache invoking a cache hit, the information is retrieved with minimal latency. Otherwise, a cache miss occurs and the information is retrieved from higher cache levels and/or system memory located external to the processing core or processor with greater latency as compared to internal cache memory. The retrieved information may be in the form of one or more cache lines incorporating the requested information. As processing continues and as the internal processor caches are filled, an increased percentage of cache hits occur thereby improving overall processor performance.
Prefetching is a commonly used technique in which blocks of information are retrieved from external system memory in advance and stored into the local processor cache(s) for faster access by the processing circuitry when needed. A “prefetcher” or prefetch engine monitors or tracks information (data and/or code) actually requested by the processor and attempts to anticipate future requests, and then submits requests to retrieve the anticipated information. Performance and efficiency, however, are only improved when the processing circuitry actually requests a significant proportion of the anticipated information in a timely fashion. A prefetching algorithm that does not retrieve the target information or otherwise retrieves too much of the wrong information may not appreciably increase overall performance and efficiency. In fact, inaccurate or otherwise inefficient prefetch algorithms may negatively impact overall performance and efficiency.
Conventional processors with internal caching mechanisms often include one or more “prefetchers” that are each preconfigured according to a predetermined prefetch algorithm. Many different types of prefetchers are known which vary from relatively simple to somewhat complex. Some prefetchers are based on a relatively simple algorithm, such as a determining and fetching based on a stride or cache line offset (e.g., such as every other cache line or every third or fourth cache line or the like). Other prefetchers are more complex. A bounding box prefetcher, for example, tracks multiple different pattern periods and attempts to identify a clear pattern period used for prefetching. A content-directed prefetcher examines the actual data that has been retrieved in an attempt to identify addresses that will be requested in the near future.
Although a given prefetcher might work very well for one process (program or application or the like), it may not perform so well or may even perform very poorly for another. Some processors may incorporate multiple prefetchers in an attempt to improve performance for a variety of different processes. Although a multiple simultaneous prefetcher approach may improve operation for some processes, such improvements may be limited because multiple prefetchers tend to thrash and conflict with each other.