This invention was made with Government support under Grant No. N0001 4-92-J-1807 awarded by the Office of Naval Research. The Government has certain rights in this invention.
1. Field of the Invention
The present invention pertains generally to a method for efficiently tracking and predicting events in a computer system primarily for activities such as pre-fetching and replacement decisions in cache systems (e.g., file system buffer caches, distributed system event caches, and disk controller input/output (I/O) caches), and more particularly to reducing computational complexity and memory requirements by placing static limits on subtrie size and by paging subtries with the events with which they are associated. The invention also provides the basis for computationally efficient data compression.
2. Description of the Background Art
Finite multi-order context models for event prediction are widely used in data compression, and Prediction by Partial Match (PPM) is a well known method for such use. PPM maintains a data structure with a plurality of substructures based on the previously seen nature of a system. An example of such a data structure is a trie having a plurality of nodes in which each node represents a sequence of events that has occurred at least once. Each node can be thought of as representing an individual event that occurred after the sequence represented by its parent, and contains a count of the number of times that its sequence has occurred. As new events occur, the trie is updated by recording the sequence of events that have occurred. At any point in the stream of events, the next event can be described as occurring after the sequence of events that has just occurred. These sequences can also be considered as part of the system's current state and can vary depending upon the length of sequences chosen. For example, for the sequence of events CACBCA, the next event can be described as occurring after A, CA, BCA, CBCA, ACBCA or CACBCA. The length of these sequences is called their "order" and the sequences that describe the conditions under which the next event will occur are called "contexts." In the example above, the sequence BCA would be called a third order context. To prevent the trie from growing too large, the maximum order or length of the sequences tracked is generally limited. For example, where m is used to denote this limit, there would be a total of m+1 contexts (0 through m) that describe the file system's current state at any point in time. In order to update the trie and use it to determine future events, an array of pointers 0 through m is typically maintained to point to the current contexts (C.sub.0 through C.sub.m). With each new event seen, a new current state is determined by updating each context in this array to reflect the addition of the new event. A new C.sub.k is then generated by examining the children of the old C.sub.k-1, searching for a child that represents the new event. If such a child exists, then this context (or sequence) C.sub.k-1 has occurred before, and is represented by this child's node. In that case, the k.sup.th element of the array is set to point to this child's node and its count is incremented. If no such child is found, then this is the first time that such a context has occurred, and a child is created to represent this event and the k.sup.th element of the array is set to point to its node. An important property of the trie is that the frequency count for each current context is equal to the sum of its children's counts plus one. As each new context is generated, the children of that context, if any, are examined to determine how likely they are to be the next event. Using the relationship Count.sub.child /(Count.sub.Parent -1), a maximum likelihood estimation of the probability of the child's event occurring can be generated.
With the rapid increase of processor speeds, the bottleneck of input/output (I/O) and network system latency has become a critical issue in computer system performance. Standard least recently used (LRU) based caching techniques offer some assistance but, by ignoring any relationships that exist between system events, they fail to make full use of the information available. As an alternative to LRU-based caching techniques, data compression event modeling techniques such as PPM, described above, have been used to drive a virtual memory cache where a finite number of most probable items are pre-fetched. A significant problem with using models such as PPM, however, is that the size of the data structures needed for such a model can quickly become too large for many applications. The result is that space requirements and the associated computational complexity cause models such as PPM to be impractical in practice. As a result, in applications where the number of different possible events (e.g., letters of the alphabet or blocks on a disk) is large, such predictive methods have only seen limited success.
It is also known to accumulate frequency counts of files that fall within a window after each file access, and to use the frequency counts to drive a pre-fetching cache. Another known approach is to have the application program inform the file system which files to prefetch. This approach, however, requires the application programs to be modified and fails to make use of any relationships that exist across applications.
Therefore, there is a need for a method for predicting sequences of events using predictive models such as PPMbut without the inherent computational complexity or memory requirements. The present invention satisfies those needs, as well as others, and overcomes the deficiencies found in previouisly developed predictive techniques.