1. Field of the Invention
The present invention relates to clusters and more specifically a system and method of providing object messages within the context of managing resources within a compute environment.
2. Introduction
The present invention applies to computer clusters and computer grids. A computer cluster may be defined as a parallel computer that is constructed of commodity components and runs commodity software. FIG. 1 illustrates in a general way an example relationship between clusters and grids. A cluster 110 is made up of a plurality of nodes 108A, 108B, 108C, each containing computer processors, memory that is shared by processors in the node and other peripheral devices such as storage discs connected by a network. A resource manager 106A for the node 110 manages jobs submitted by users to be processed by the cluster. Other resource managers 106B, 106C are also illustrated that may manage other clusters (not shown). An example job would be a weather forecast analysis that is compute intensive that needs to have scheduled a cluster of computers to process the job in time for the evening news report.
A cluster scheduler 104A may receive job submissions and identify using information from the resource managers 106A, 106B, 106C which cluster has available resources. The job would then be submitted to that resource manager for processing. Other cluster schedulers 104B and 104C are shown by way of illustration. A grid scheduler 102 may also receive job submissions and identify based on information from a plurality of cluster schedulers 104A, 104B, 104C which clusters may have available resources and then submit the job accordingly.
Several books provide background information on how to organize and create a cluster or a grid and related technologies. See, e.g., Grid Resource Management, State of the Art and Future Trends, Jarek Nabrzyski, Jennifer M. Schopf, and Jan Weglarz, Kluwer Academic Publishers, 2004; and Beowulf Cluster Computing with Linux, edited by William Gropp, Ewing Lusk, and Thomas Sterling, Massachusetts Institute of Technology, 2003.
There is a problem in the environment described in FIG. 1. When objects are passed down through the multiple layers from a grid level to a local level, when it comes time to actually troubleshoot or diagnose issues or failures, one can only diagnose failures that occur within any given layer. The logs of any given layer or the interface of any given layer you can only see issues that have occurred at that layer. When one starts getting into a grid scheduling environment or cluster scheduling environment in which objects or children of objects pass down through multiple layers, it becomes very difficult to track what that object is doing without seeing the messages. Therefore, the prior art model requires an administrator to look at a certain level. When a failure is detected, the administrator checks to see if that failure is local to that layer. If not, then one goes down to the layer below to look at the object and its correlated features and then determine whether the failure is relevant to that layer. If not, the administrator continues going down layers until a root cause of the failure is identified and then the administrator works his or way back up layers.
The issue of failure detection and reporting becomes more pronounced in environments where there are many to one as the administrator cascades down layers. For instance, a grid scheduler 102 may actually be talking to multiple cluster schedulers 104A, 104B and 104C and have a single job that spans multiple cluster schedulers. The cluster schedulers may have a many to one relationship between themselves and resource managers 106A, 106B and 106C causing a single job at a cluster level to be mapped onto multiple resource managers. The resource managers actually map out to multiple nodes 108A, 108B and 108C and therefore tasks associated with each resource manager may be scattered across multiple nodes. In addition, direct startup failures can occur on each one of these person's compute nodes.
When a failure does occur, the system will write information to logs and perform a general failure response routine. FIG. 1 illustrates the logs 112, 114, 116 and 118, each log being related to an individual layer of the cluster or grid.
The problem with this arrangement is that there is a lack of communication between layers in a cluster/grid system. Where a source of failure exists on a node 108A, 108B or 108C, for example an operating system level failure, the reporting and handling of how the cluster or group of cluster should react to that failure is incomplete and deficient. The upper level layers of resource managers, cluster schedulers and grid schedulers cannot receive the information regarding the source of the failure such that they can respond by rescheduling or modifying the cluster environment.
An example can further illustrate the problem. From a job list 120, a user submits a job for processing either on the grid scheduler level or the cluster scheduler level. The grid scheduler 102 communicates With the cluster schedulers 104A, 104B and 104C, and the cluster schedulers commands the resource managers 106A, 106B and 106C to start the submitted job. For example, resource manager 106A attempts to start the job on a number of nodes 108A and 108B. Suppose node 108B actually has an operating system failure wherein it is out of memory. In that case, it will write a detailed message to log 112. The node will then propagate a failure indicating that it cannot start the to the resource manager 106A. The resource manager writes a message to log 114 that the node cannot start will propagate a message to the cluster scheduler 104A. Cluster scheduler 104A writes a message to log 116 that the job cannot start and informs the grid scheduler 102 that for some unknown reason the job cannot start. The grid will write a message to log 118 regarding the failure of the job to start. However, the user looking in his local queue sees the job failed for some unknown reason. What is needed is an improved communication system and method for reporting and handling system failures in a compute environment such as a cluster or a grid.