This invention relates to autonomic computing, and more particularly to a method for clustering processors and assigning service points in a system for efficient, fault-tolerant, self-configuring and self-healing operation.
Autonomic computing, which generally refers to design of multiprocessor computing systems which are self-monitoring, self-configuring, fault-tolerant and self-healing, is a topic of considerable theoretical and practical interest. One important consideration in building a successful autonomic computing system is to embed the fault tolerance of the system within itself to enhance its self-healing mechanism. The self-healing mechanism would require that in case of a fault the system would immediately detect the nature of the fault and try to correct the fault. In a case where it could not correct for the fault, the system would minimize the ensuing performance degradation by assigning the task of the faulty processor to one or more other processors. In a typical computer architecture, this task of fault detection and management is either done by one of the processors or by a master processor.
The processors comprising an autonomic computing system may be distributed over a geographically large area. Furthermore, the processors may be of many different types, running many different types of operating systems, and connected by a distributed network. The various processors are often geographically arranged in clusters. Such an arrangement does not permit having one master processor managing the fault tolerance of the entire system. It is therefore advantageous to have some of the processors do the fault management. These processors will be referred to herein as service points.
A typical system utilizing a service point is shown in FIG. 1A. The system 1 includes several interconnected processors 2, with one of those processors assigned to be the service point 10. Generally, the service point is chosen to be the processor having the smallest distance to the other processors. The term “distance” is used herein as a measure of the communication time required between processors. The service point has several tasks in addition to its own regular computing load: (1) detecting a faulty processor elsewhere in the system; (2) replacing a faulty processor by reassigning that processor's tasks to other processors; (3) monitoring the tasks being performed by the other processors; and (4) balancing the load on the system to ensure optimum performance. FIG. 1B illustrates a situation where a processor 3 has had a fault detected by the service point 10, and has been removed from the system; the remainder of the system continues to operate.
Though fault tolerance using redundant computation has been used for some time, the self-healing and self-configuring features of current autonomic computation systems raise several new concerns, for example: (1) The self-configurable and the self-adjustable features of an autonomic system work much better when all of the processors (including those distantly located) are uniform and interchangeable. This means that the service points should not be special processors but rather chosen from the same set of processors working an extra load.
(2) Usually, in parallel and scalable computer architectures the number of service points is fixed and cannot be specified as a fraction of the number of active processors. However, having too few service points causes the self-healing mechanism to be too slow; having too many service points degrades the overall performance of the system.
(3) Since the autonomic computing system works in a dynamic environment, it is important to dynamically optimize clustering and service point assignment to optimize system performance. It should be noted that in an on-demand computing environment, the total number of processors (and thus the composition of clusters and assignment of service points) is constantly changing in response to the computing load.
In the self-configuring environment of an autonomic computing system, it generally is not possible to preassign the service points. Therefore, depending on the requirements of the situation any current processor can be dynamically assigned to be a service point. On the other hand, creating too many service points leads to a large computational load on the system. It is desirable, therefore, to keep the number of service points limited to a certain fraction of the working processors.
The current problem is, therefore: given a set of processors in a distributed and dynamic environment, and a number representing the fractional value of the ratio of the maximum number of service points to the total number of working processors, to determine the service points and the processors each service point would service.
The idea of clustering has been successfully applied to many other fields. However, in all the above-noted applications areas the number of clusters cannot be specified a priori. It is necessary to put an upper bound to the number of clusters so that the overhead for extra service points is always bounded. The problem of clustering with a fixed limit is generally known as intractable: that is, an efficient optimal solution does not exist. There is still a need, however, for a solution that is efficient though suboptimal. More particularly, there is a need for an efficient procedure for dynamically assigning the various processors in a system to clusters, and for assigning service points within each cluster, to ensure optimum performance (including self-configuring and self-healing) of the system.