1. Field of the Invention
The present invention relates to a compute resource management system and more specifically to a system and method of managing and monitoring resources within a compute environment such as a cluster and/or grid environment.
2. Introduction
Managers of compute environments such as clusters or grids desire maximum return on investment often meaning high system utilization and the ability to deliver various qualities of service to various users and groups. A cluster is typically defined as a parallel computer that is constructed of commodity components and runs as its system software commodity software. A cluster contains nodes each containing one or more processors, memory that is shared by all of the processors in the respective node and additional peripheral devices such as storage disks that are connected by a network that allows data to move between nodes. A cluster is one example of a compute environment. Other examples include a grid, which is loosely defined as a group of clusters, and a computer farm which is another organization of computer for processing.
General background information on clusters and grids may be found in several publications. See, e.g., Grid Resource Management, State of the Art and Future Trends, Jarek Nabrzyski, Jennifer M. Schopf, and Jan Weglarz, Kluwer Academic Publishers, 2004; and Beowulf Cluster Computing with Linux, edited by William Gropp, Ewing Lusk, and Thomas Sterling, Massachusetts Institute of Technology, 2003.
It is generally understood herein that the terms grid and cluster are interchangeable in that there is no specific definition of either. In general, a grid will comprise a plurality of clusters as will be shown in FIG. 1A. Several general challenges exist when attempting to maximize resources in a grid. First, there are typically multiple layers of grid and cluster schedulers. A grid 100 generally comprises a group of clusters or a group of networked computers. The definition of a grid is very flexible and may mean a number of different configurations of computers. The introduction here is meant to be general given the variety of configurations that are possible. A grid scheduler 102 communicates with a plurality of cluster schedulers 104A, 104B and 104C. Each of these cluster schedulers communicates with a respective resource manager 106A, 106B or 106C. Each resource manager communicates with a respective series of compute resources shown as nodes 108A, 108B, 108C in cluster 110, nodes 108D, 108E, 108F in cluster 112 and nodes 108G, 108H, 108I in cluster 114.
Local schedulers (which may refer to either the cluster schedulers 104 or the resource managers 106) are closer to the specific resources 108 and may not allow grid schedulers 102 direct access to the resources. Examples of compute resources include data storage devices such as hard drives and computer processors. The grid level scheduler 102 typically does not own or control the actual resources. Therefore, jobs are submitted from the high level grid-scheduler 102 to a local set of resources with no more permissions that then user would have. This reduces efficiencies and can render the reservation process more difficult.
The managers of such clusters need to understand how the available resources are being delivered to the various users over time and need the ability to have the administrators tune ‘cycle delivery’ to satisfy the current site mission objectives.
How well a scheduler succeeds can only be determined if various metrics are established and a means to measure these metrics are available. While statistics are important, their value is limited unless optimal statistical values are also known for the current environment including workload, resources, and policies. If one could determine that a site's typical workload obtained an average queue time of 3 hours on a particular system, this would be a good statistic. However, if one knew that through proper tuning, the system could deliver an average queue time of 1.2 hours with minimal negative side effects, this would be valuable knowledge. Viewing and getting access to the multitude of statistics that are available in the management of a compute environment can be daunting. Accordingly, what is needed in the art is a way to improve an administrator's ability to tune jobs and reservations and other management of a compute environment.