Distributed computing systems have found application in a number of different computing environments, particularly those requiring high performance and/or high availability and fault tolerance. In a distributed computing system, multiple computers connected by a network are permitted to communicate and/or share workload. Distributed computing systems support practically all types of computing models, including peer-to-peer and client-server computing.
One particular type of distributed computing system is referred to as a clustered computing system. “Clustering” generally refers to a computer system organization where multiple computers, or nodes, are networked together to cooperatively perform computer tasks. An important aspect of a computer cluster is that all of the nodes in the cluster present a single system image—that is, from the perspective of a client or user, the nodes in a cluster appear collectively as a single computer, or entity. In a client-server computing model, for example, the nodes of a cluster collectively appear as a single server to any clients that attempt to access the cluster.
Clustering is often used in relatively large multi-user computing systems where high performance and reliability are of concern. For example, clustering may be used to provide redundancy, or fault tolerance, so that, should any node in a cluster fail, the operations previously performed by that node will be handled by other nodes in the cluster. Clustering is also used to increase overall performance, since multiple nodes can often handle a larger number of tasks in parallel than a single computer otherwise could. Often, load balancing can also be used to ensure that tasks are distributed fairly among nodes to prevent individual nodes from becoming overloaded and therefore maximize overall system performance. One specific application of clustering, for example, is in providing multi-user access to a shared resource such as a database or a storage device, since multiple nodes can handle a comparatively large number of user access requests, and since the shared resource is typically still available to users even upon the failure of any given node in the cluster.
In many clustered computer systems, the services offered by such systems are implemented as managed resources. Some services, for example, may be singleton services, which are handled at any given time by one particular node, with automatic failover used to move a service to another node whenever the node currently hosting the service encounters a problem. Other services, often referred to as distributed services, enable multiple nodes to provide a service, e.g., to handle requests for a particular type of service from multiple clients.
Clustered computer systems often establish groups, or clusters, of nodes to provide specific services. The nodes that participate in a group or cluster, or alternatively, the specific instances of the associated service running on such nodes, are typically referred to as “members” of the group or cluster. In some groups or clusters, one member is designated a primary member, with the other members designated as backup or replica members, which are each capable of assuming the responsibility of a primary member whenever a failure occurs in that primary member.
One service that may be supported by a clustered computer system is a cache framework. A cache framework, which is a type of cache, generally provides for rapid access to frequently used data by retaining the data in a volatile memory such as the main storage of one or more computers in a clustered computer system. A cache framework may be object-based and transactional in nature, and often supports very large cache sizes. A cache framework can greatly improve performance in a clustered computer system by reducing the access time for frequently accessed data requested by clients of the cache framework (e.g., other services and applications running in the clustered computer system and/or clients of the clustered computer system).
Like any cache, the performance gains provided by a cache framework are realized when requested data is already stored in the cache framework, such that the overhead associated with retrieving the data from a slower, non-volatile memory such as an external, backend data repository is avoided. Some cache frameworks rely on a technique known as “lazy loading,” where data is not loaded into a cache framework until the first time the data is requested. With lazy loading, the initial request incurs the overhead of loading the data into the cache framework, and as a result, the benefits of the cache framework are not realized until the second and subsequent requests are made for the data. In addition, in some applications that are subject to uneven workloads, e.g., application that receive a proportionately large number of client requests at particular times of the day, lazy loading can place a severe burden on a cache framework as well as any backend data repository, leading to poor response times and decreased performance.
One approach that has been used in a cache framework to address the performance penalties associated with lazy loading is to perform a preload of the cache framework, e.g., during the initialization of a clustered computer system. By preloading frequently used data into a cache framework, the initial performance penalty associated with lazy loading is often avoided, resulting in faster access and better performance, as well as reduced stress on the cache framework and any backend data repositories during periods of heavy workload.
Preloading a cache framework, however, does require substantial system resources, given that all of the data preloaded into a cache framework typically must be retrieved from a comparatively slow backend data repository. For example, in some systems it may take on the order of several hours to preload a large cache (e.g., with one or more gigabytes of data).
The preloading of a cache framework is typically managed by a primary member in the group or cluster that hosts a cache framework. In some instances, however, a primary member may encounter a failure during a cache framework preload operation, either as a result of a failure in the primary member or another failure in the node upon which the primary member resides, which may necessitate that a failover be performed to a backup or replica member.
In conventional cache frameworks, the failover to a backup or replica member prior to completion of a cache framework preload operation requires that the preload operation be restarted from the beginning. Given the substantial resources that are typically involved with preloading a cache framework, however, a considerable performance penalty can be incurred as a result of having to restart a cache framework preload operation, particularly if the failure in a primary member occurs after a substantial portion of the cache framework preload operation had already been completed. For example, should a primary member fail just before completing a preload operation, the total time required to complete the preload operation using a replica member would be roughly double the time that would be required if no failure had occurred.
Therefore, a significant need exists in the art for a manner of reducing the overhead associated with cache preload operations, and particularly for cache preload operations that are interrupted prior to completion as a result of a failure.