Clustering is a common technique to improve the availability of services, such as web hosting services and database services, provided by a computerized system. Clustering refers to using more than one computing node, or computing element, such as more than one computing device like a server computing device, in a cooperative manner to provide one or more desired services. For example, a simple clustered system may have two computing elements or nodes. If one of these computing nodes fails, then ideally the other computing node is able to take over, so that the services provided by the system can continue being provided even in light of this failure.
Clustered systems can be relatively complex. There may be two, three, four, or more computing nodes or elements within such a system. The software run by the system may be divided among the computing elements of the system in a fairly particular manner. For example, a given system may host a number of web-related services and a number of database-related services. A node of the system may be allowed to run a particular subset of the web-related services, and a particular subset of the database-related services, so that no node, or server, is too overloaded, to ensure the performance of these services.
As clustered systems become more complex, ensuring that failures within such systems are recovered from in a graceful manner becomes more important and yet more difficult. Failures within clustered systems may include hardware failures, such as the hardware of one or more of the nodes or elements failing, as well as software failures, such as the software of one or more of the computing elements of the cluster failing. Designers of clustered systems typically provide recovery rules, or policies, which instruct a clustered system how to recover from failures. For example, if a given computing element fails, then the software running on that computing element may be moved to other computing elements within the system. As another example, if the software on a given computing element fails, causing it to consume too many resources on the computing elements, the other software running on that computing element may be moved to other computing elements within the system, so as not to impede the performance of this software.
For a simple clustered system having just two nodes, elements, or servers, and a small number of software services running on these nodes, it is relatively easy to construct a set of recovery rules dictating what is to occur for most if not all combinations of different failures that may afflict the system. Furthermore, because a simple clustered system may have a relatively finite number of things that can go wrong, testing these recovery rules is also a fairly straightforward process. For instance, all of the possible failures can be forced within an actual instance of the clustered system, to verify that the system recovers in the desired manner.
However, for complex clustered systems, it may be difficult to construct a set of recovery rules that allows a system to properly recover from every possible combination of failures. This is because the designer of such a complex clustered system has to envision all the different combinations of failures that are likely to occur, and then fashion the recovery rules accordingly. Furthermore, actual testing of all the different combinations of failures is time- and cost-prohibitive: it can be difficult if not impossible for the designer to force an actual instance of a clustered system to fail in all these different ways.
Therefore, typically what occurs is that a designer of a clustered system tests just some number of failures of the clustered system by actually failing the clustered system in a limited number of ways. Once it has been shown that the clustered system in actuality properly recovers from these failures, testing is finished. However, because such testing is not exhaustive, actual failures in the clustered system may still occur that the designer may have not foreseen, and for which his or her developed recovery rules do not provide for proper recovery. This possibility effectively limits the reliability of clustered systems.
For these and other reasons, therefore, there is a need for the present invention.