1. Field of the Invention
This invention relates generally to computer system optimization and, more particularly, to data prefetching methods and apparatus using data compression principles for prefetching data into cache.
2. Description of the Related Art
Prefetching data into cache is one approach to preventing Central Processing Unit (CPU) stalls in modern computer systems. Cache is a relatively small high-speed random access memory that is paged out of main memory and holds the instructions and data most likely to be accessed in the future. When the processor issues an instruction, data is fetched from memory. If the data is not in the cache, a request is made to the main memory to fetch the data. If the slower main memory is accessed, the computer system slows down, causing the CPU to stall.
Prefetching involves a prediction process: given a current memory state, the prefetch hardware determines a future referenced memory address and stores the predicted future reference in fast memory, e.g., cache. By doing so, accesses to the slower main memory are minimized, preventing CPU stalls and improving the overall speed of the computer system.
Several prefetching approaches have been developed over the years. One class of prefetching approaches is termed fixed prefetching. Fixed prefetching involves fetching a fixed function of the sequence of past misses. Examples of fixed prefetching algorithms include stride predictors and stream buffers. Stride predictors prefetch data references characterized by regular strides or address offsets. Stream buffers are designed to prefetch sequential streams of cache lines independent of program content.
Stride predictors tend to be more accurate than stream buffers. Stream buffers, on the other hand, tend to capture more cache lines and use resources more efficiently. Fixed prefetching algorithms like stride predictors and stream buffers work reasonably well on structured workloads. However, system performance degrades when fixed prefetching is used in combination with dynamically changing workloads.
Another class of prefetching approaches is termed adaptive prefetching. Adaptive prefetching algorithms learn the behavior of a sequence of memory misses and then predict and prefetch data accordingly. An example of adaptive prefetching is described in U.S. Pat. No. 5,485,609 to Vitter et al., issued Jan. 16, 1996, titled xe2x80x9cOnline Background Predictors and Prefetchers for Locality Management.xe2x80x9d Vitter teaches an online prediction technique based on data compression principles for prefetching pages from memory. The Vitter prediction technique uses a Lempel-Ziv (xe2x80x9cLZxe2x80x9d) data compression algorithm to make predictions based on past history and other information available at the time of the prediction.
One disadvantage of the Vitter invention is that it makes predictions over a set of past events, so that it can only predict events seen before: i.e., it cannot predict unknown or never-before-seen events. Moreover, predicting the next most likely event prevents CPU stalls only if the page is prefetched from memory before it is actually requested by the CPU. Thus, CPU time is only partially spared if the CPU""s request for data occurs before the data is actually loaded into cache, even if the prediction is accurate.
There are a number of sources for prefetching information, including the address reference stream. Using the address reference stream has the disadvantage of requiring the prefetcher hardware to be located on the same chip as the CPU. Additionally, the prefetcher hardware needs to be highly efficient since the predictor analyzes a large number of references.
Accordingly, a need remains for a prefetch method and apparatus that is accurate, timely, and provides greater coverage. A need also remains for a prefetch method and apparatus that can predict events never before observed and that use the available resources more efficiently, and that can completely prefetch the data before the CPU needs it.
An embodiment of the present invention includes a prefetch method for prefetching data into cache. The prefetch method is adaptive, but it also includes a fixed stage that generates a rough estimate of the address to be prefetched, which is subsequently refined. The adaptive stage of the prefetch method includes building a dynamic data structure in tree form, the tree having a root, a plurality of internal nodes, a plurality of branches, and a plurality of leaves. The root is also known as the root node, and leaves are also known as external nodes. Within this application, the term xe2x80x9cnodexe2x80x9d generally refers to the root node, internal nodes, and leaves. The branches emanate from the internal nodes and the root; the leaves have no outgoing branches. The root has no incoming branch; each internal node and leaf has exactly one incoming branch. The branches represent address differences between cache misses and rough predictions; each internal node and leaf stores a corresponding weight of its incoming branch. The method traverses the tree so that the current node corresponds to index j. The method receives an address difference at the current node and updates branches from the current node up toward the root of the tree. The method maintains the data structure in a dynamic fashion. New nodes and branches are created and deleted, and weights are updated at various stages during data processing. The method traverses the tree in the direction of the address difference until the method reaches a new current node. The method uses the weights of the branches at the new current node and prefetches data accordingly.
Another aspect of the present invention is its capacity to predict events that have not been seen before.
Another aspect of the present invention is its capacity to prefetch cache lines far enough in advance to prevent CPU stalls, while simultaneously minimizing the cache lines"" residency in the prefetch cache. Although Vitter teaches cache line prediction, Vitter does not teach a prediction technique that loads cache lines sufficiently early to completely prevent CPU stalls. For example, Vitter will always tend to predict the next cache line needed. In contrast, the present invention will predict cache lines several steps in advance if so dictated by the memory latency.
Another embodiment of the present invention includes a method for optimizing computer system performance by taking actions that sequentially minimize a total loss function. Here, a xe2x80x9clossxe2x80x9d is a quantity whose minimization results in a system performance benefit. For example, some measurement of expected stall time can be taken as a xe2x80x9closs.xe2x80x9d The total loss L is an accumulation of the losses produced by individual outcomes in accordance with       L    =                  ∑                  j          =          0                n            ⁢              xe2x80x83            ⁢              L        ⁢                  xe2x80x83                ⁢                  (                      x            j                    )                                L      ⁢              xe2x80x83            ⁢              (                  x          j                )              =                  ∑                  i          =          0                                      sl            ⁡                          (              j              )                                -          1                    ⁢              xe2x80x83            ⁢                        C          i                ⁢                  xe2x80x83                ⁢                  (                                    b                              j                -                i                                      ,                          x              j                                )                    
where bjxe2x88x92i is the (jxe2x88x92i)th action taken, xj is the jth outcome received, and sl(j) (which may depend on j) is the number of past actions contributing to the loss caused by xj, sl(j)xe2x89xa6j, each action bjxe2x88x92i contributing Ci(bjxe2x88x92i, xj). For every j, 1xe2x89xa6jxe2x89xa6n, action bj may depend on x0, x1, . . . , xjxe2x88x921, but is taken before seeing xj. The possible actions are assumed to be in one-to-one correspondence with the possible outcomes, and for each outcome x there is one (and only one) action b for which Ci(b, x)=0. The method constructs a data structure in tree form, the tree having a root, a plurality of internal nodes, a plurality of branches, and a plurality of leaves. The branches emanate from the internal nodes and the root; the leaves have no outgoing branches. The root has no incoming branch; each internal node and leaf has exactly one incoming branch. The branches represent differences between corresponding outcomes xj and rough predictions yj, where yj depends on x0, x1, . . . , xjxe2x88x921 through a fixed, predetermined function; each internal node and leaf stores the corresponding weight of its incoming branch. The method traverses the tree so that the current node corresponds to the index j. The method receives an outcome xj at the current node and updates branches from the current node up toward the root of the tree. The method traverses the tree in the direction of the branch corresponding to outcome xj until the method reaches a new current node. The method uses the weights of the branches at the new current node and takes an action accordingly. A branch emanating from node j, representing a difference xjxe2x88x92yj, also represents the action bj corresponding to outcome xj.
Updating the branch on the tree includes associating the index j to the current node, associating the indices jxe2x88x92i, 0xe2x89xa6i less than sl, to the nodes on the path from the root to the current node, and incrementing the weights corresponding to the branches associated with xjxe2x88x92yjxe2x88x92i at the nodes associated with the indices jxe2x88x92i by an increment amount. The method uses the rough predictor to determine which branches are incremented. Incrementing the weights includes incrementing the weight for one branch at each node jxe2x88x92i, 0xe2x89xa6i less than sl(j), where the branch incremented corresponds to the action b* for which Ci(b*, xj)=0. Note that the action b* that produces a zero loss for Ci(b*, xj) depends on both the outcome xj and the node jxe2x88x92i; b* can vary for different xj and different jxe2x88x92i, 0xe2x89xa6i less than sl(j).
Another embodiment of the present invention includes a reset function. In a practical implementation, the amount of storage space available for the nodes in the tree data structure is finite. The system starts with a minimal tree (e.g., just the root, or the root and a small number of nodes), and builds it up as new data (i.e., memory addresses) are processed. Eventually, the tree will grow to fill the available storage space. At that point, the data structure needs to be reset, to allow for further adaptation to new data. The reset function selectively deletes nodes from the data structure. Various reset functions are possible, striking a balance between preserving useful information obtained from past data and allowing growth space for new information.
Another embodiment of the present invention includes a victim cache. A xe2x80x9cvictim cachexe2x80x9d is known in the art as a buffer that contains lines removed from the main cache, often but not necessarily because of associativity limits in the main cache, in the hopes that these lines will be needed again soon. The victim cache thus reduces conflict misses, which makes a cache with low associativity appear to have higher associativity. A victim cache includes xe2x80x9cdirtyxe2x80x9d cache lines: i.e., cache lines containing data that have not been written back to main memory. The prefetch cache can be combined with the victim cache and thus yield benefits of both prefetching and apparent higher associativity simultaneously. When the present invention includes a victim cache, the prefetch method must factor in the time needed to write xe2x80x9cdirtyxe2x80x9d cache lines back to main memory.
The foregoing and other objects, features, and advantages of the invention will become more readily apparent from the following detailed description of a preferred embodiment that proceeds with reference to the following drawings.