The present invention relates to cache management and, more particularly, to managing replacement of data in a cache on a node based on caches of other nodes.
A cluster is a collection of nodes, each with a volatile dynamic memory devices (xe2x80x9cdynamic memoryxe2x80x9d) that is shared by one or more CPUs. Each node in a cluster is connected to the other nodes in the cluster via an interconnect. Typically, each node in the cluster shares resources with other nodes in the cluster. Such resources may be, for example, data items, such as pages stored on a static memory device (e.g. disk storage devices).
Software, referred to as an application, runs on each of the nodes. The application operates upon data from shared data resources. For example, a database system may run on each node in the cluster and manage data stored on data blocks on the disk storage devices.
To speed access to data items needed by an application running on a node in a cluster, copies of the data items are kept in the cache of the node. A cache is a storage medium used to store a copy of a data item for efficient access, where that data item may be obtained less efficiently from another source. The other, less-efficiently-accessed source is herein referred to as a secondary data resource. For example, a database system running on a node in a cluster typically stores data blocks durably on disk. However, the database system loads copies of the data blocks in a cache in dynamic memory to access them. Similarity, a browser running on a personal computer stores frequently accessed web pages, obtained over a network, in a cache in the form of disk files.
A cache on a node is managed by a cache manager. A cache manager is a system of software that attempts to manage a cache in a manner that reduces input and output between the cache and a secondary data resource. Typically, the cache manager keeps a copy of the most frequently accessed data items in the cache. In addition, a cache manager may track modifications made by an application to a data item, and may arrange, through xe2x80x9cwrite backxe2x80x9d, xe2x80x9cwrite throughxe2x80x9d, and various logging mechanisms, that each data item is durably stored in a secondary data resource (e.g. disk storage device).
Data items frequently accessed by a particular application are collectively referred to as the working set of the application. When the working set of an application is too big to fit into the cache on a node on which the application is running, the application thrashes. Thrashing involves replacing cached data items in a cache with data items from secondary data resources at an undesirably high frequency. Thrashing occurs when frequently accessed data items continuously replace each other in the cache, causing frequent cache misses and thereby drastically increasing the average data item access time. When thrashing is occurring and an application needs data items, the application is too often delayed by waiting for the retrieval of a needed data item from the secondary data resource. Thrashing thus reduces throughput. On the other hand, when the working set can fit into available dynamic memory, the application does not thrash, and the throughput of the system is improved.
A scalable cluster is a cluster to which resources may be added to proportionally increase the capacity of the cluster as a whole. A resource may be, for example, a computer, a CPU, a storage device, interconnect bandwidth, or dynamic memory per computer. Typically, capacity is measured according to some measure of business growth, such as the number of users, web page hits per second, transactions per period of time, and/or terabytes of data.
As the need for capacity grows in a cluster, usually the need for CPUs grows, and sometimes the working set grows. To accommodate the need for additional CPUs, more nodes may be added to the cluster. However, accommodating an increased working set requires more than just adding computers to the cluster. Rather, the dynamic memory of all the nodes in the cluster must be increased because each node needs enough dynamic memory to store the working set. Each node needs enough dynamic memory to store the entire working set because a large portion of the frequently accessed pages are typically duplicated in the cache of each node.
For any computer, the amount of dynamic memory that may be added is limited. Even before the limit is reached, the total amount of dynamic memory required by the application is equal to its working set multiplied by the number of nodes in the cluster. As the cluster grows, if the working set grows as well, then the price-performance decreases with cluster size, and so the cluster does not scale. If the working set is larger than the limit, then the application thrashes. Adding nodes to the cluster may increase the CPU capacity of the cluster, but will not eliminate the thrashing. Consequently, the throughput of the cluster suffers.
Based on the foregoing, it is desirable to provide a mechanism to reduce thrashing on a cluster that does not require adding dynamic memory to each node in the cluster.
A mechanism is described for managing the caches on nodes in a cluster. The caches are globally managed so that a data item may be retained in any cache on the nodes. This may be accomplished by, for example, a replacement policy for replacing data items stored in the buffers of the caches, where a buffer is selected for replacement in a manner that accounts for factors that include the state of the caches of other nodes. Some cached data items are designated as globally shared, and assigned (either statically or dynamically) a primary cache. For example, if a buffer holds a copy of a data item whose primary cache is another node, then the data item in the buffer is favored for replacement over a local data item or a global data item for which this is the primary cache. According to another aspect of the invention, the cache retention value of buffers on different nodes are compared, and the buffer with the lowest cache retention value is selected for replacement. According to yet another aspect of the present invention, the replacement policy accounts for the configuration of other caches in the cluster.