1. Field
The present invention relates to managing a compute environment and more specifically to a system and method of managing energy consumption within a compute environment with respect to applying power-state job preemption principles to improve workload management.
2. Introduction
Managing consumption of resources in a compute environment such as a grid, cluster farm, or on-demand server is a complex and challenging process. Grid computing may be defined as coordinated resource sharing and problem solving in dynamic, multi-institutional collaborations. Many computing projects require much more computational power and resources than a single computer may provide. Networked computers with peripheral resources such as printers, scanners, I/O devices, storage disks, scientific devices and instruments, etc. may need to be coordinated and utilized to complete a task. The term compute resource generally refers to computer processors, memory, network bandwidth, and any of these peripheral resources as well. A compute farm may comprise a plurality of computers coordinated for such purposes of handling Internet traffic. For example, the web search website Google® uses a compute farm to process its network traffic and Internet searches.
Grid/cluster resource management generally describes the process of identifying requirements, matching resources to applications, allocating those resources, and scheduling and monitoring grid resources over time in order to run grid applications or jobs submitted to the compute environment as efficiently as possible. Each project or job utilizes a different set of resources and thus is typically unique. For example, a job may utilize computer processors and disk space, while another job may require a large amount of network bandwidth and a particular operating system. In addition to the challenge of allocating resources for a particular job or a request for resources, administrators also have difficulty obtaining a clear understanding of the resources available, the current status of the compute environment and available resources, and real-time competing needs of various users. One aspect of this process is the ability to reserve resources for a job. A cluster manager seeks to reserve a set of resources to enable the cluster to process a job at a promised quality of service.
General background information on clusters and grids may be found in several publications. See, e.g., Grid Resource Management, State of the Art and Future Trends, Jarek Nabrzyski, Jennifer M. Schopf, and Jan Weglarz, Kluwer Academic Publishers, 2004; and Beowulf Cluster Computing with Linux, edited by William Gropp, Ewing Lusk, and Thomas Sterling, Massachusetts Institute of Technology, 2003.
It is generally understood herein that the terms grid and cluster are interchangeable, although they have different connotations. For example, when a grid is referred to as receiving a request for resources and the request is processed in a particular way, the same method may also apply to other compute environments such as a cluster, on-demand center or a compute farm. A cluster is generally defined as a collection of compute nodes organized for accomplishing a task or a set of tasks. In general, a grid comprises a plurality of clusters as shown in FIG. 1. Several general challenges exist when attempting to maximize resources in a grid. First, there are typically multiple layers of grid and cluster schedulers. A grid 100 generally comprises a group of clusters or a group of networked computers. The definition of a grid is very flexible and may mean a number of different configurations of computers. The introduction here is meant to be general given the variety of configurations that are possible. A grid scheduler 102 communicates with a plurality of cluster schedulers 104A, 104B and 104C. Each of these cluster schedulers communicates with a respective resource manager 106A, 106B or 106C. Each resource manager communicates with a respective series of compute resources shown as nodes 108A, 108B, 108C in cluster 110, nodes 108D, 108E, 108F in cluster 112 and nodes 108G, 108H, 1081 in cluster 114.
Local schedulers (which may refer to either the cluster schedulers 104 or the resource managers 106) are closer to the specific resources 108 and may not allow grid schedulers 102 direct access to the resources. The grid level scheduler 102 typically does not own or control the actual resources. Therefore, jobs are submitted from the high level grid-scheduler 102 to a local set of resources with no more permissions than that user would have. This reduces efficiencies and can render the resource reservation process more difficult.
The heterogeneous nature of the shared compute resources also causes a reduction in efficiency. Without dedicated access to a resource, the grid level scheduler 102 is challenged with the high degree of variance and unpredictability in the capacity of the resources available for use. Most resources are shared among users and projects and each project varies from the other. The performance goals for projects differ. Grid resources are used to improve performance of an application but the resource owners and users have different performance goals ranging from optimizing the performance for a single application to getting the best system throughput or minimizing response time. Local policies may also play a role in performance.
As the use of on demand centers and new Internet services such as additional music downloads and video on demand and Internet telephony increases, the number of servers and nodes used within the Internet will continue to increase. As the number of servers increase in on demand centers, grids, clusters and so forth, the amount of electricity used by such servers also increases. Estimates of the total amount of electricity used by servers in the U.S. and the world have been made by combining measured data and estimates of power used by the most popular servers within data on an installed base. Many of recent estimates have been based on more detailed data than previous estimates. Policy makers and businesses are beginning to notice and are attempting to address these issues in the industry.
Aggregate electricity used for servers has doubled over the period from the years 2000 to 2005 both in the U.S. and worldwide. Most of this growth was the result of growth of the number of less expensive servers, with only a small part of that growth being attributed to the growth in the power use per unit. For example, total power used by servers represented about 0.6 percent of total U.S. electricity consumption in 2005. However, when cooling an auxiliary infrastructure is included, that number grows to 1.2 percent, which is an amount that is comparable to that for televisions. The total power demand in 2005, which includes the associated infrastructure, is equivalent to about five 1000 MW power plants for the U.S. and 14 such plants for the world. The total electricity bill for operating these servers and associated infrastructure in 2005 was about 2.7 billion dollars for the U.S. and 7.2 billion for the world. Accordingly, what is needed in the art, is an improved mechanism to manage power consumption in compute environments such as clusters and grids or those that are similarly configured.