This invention relates to partitioning of resources in a cluster computing environment. More specifically, it relates to dynamic allocation of resources in response to application and system triggers in a cluster computing environment wherein partitioning of resources is desirable or necessary to support a variety of applications and operating requirements.
Allocation of computer resources to parallel-running tasks is a challenge for systems of all sizes. The parallel processing architecture involves the use of many interconnected processors to access large amounts of data. In a massively parallel processing system, as well as in a network of computers, a relatively large number of separate processing elements are interconnected to simultaneously process a large number of tasks at speeds far exceeding those of conventional computers. Though such computing environments are often composed of many nodes, the nodes are viewed and function as one single resource. The grouping of all nodes into a single resource creates advantages in increased capacity and speed. However, to perform parallel operations efficiently, it is desirable to have the capability of allocating the resources among different tasks as needed.
Carving out or allocating parts of the system to run tasks without interfering with each other is commonly referred to as xe2x80x9cpartitioning.xe2x80x9d Partitioning, in general, is the ability to divide up system resources into groups of parts in order to facilitate particular management functions. The structure of massively distributed parallel processing systems provides the opportunity to partition the system into groups of nodes for various purposes.
In the past, some partitioning schemes have been devised which partition a system into persistent, or static partitions. Persistent partitions are quasi-permanent groupings of resources which persist or survive failure or shutdown of the system. One implements a static or persistent partitioning scheme when the set of potential applications to be executed on the system is not highly variable, so that resources can be dedicated to those applications. In the state of the art of parallel processing and computer networking, however, it is unrealistic in many instances to assume that any system will only be required to run a pre-set number of applications and that static partitioning can be maintained over the lifetime of the system. Moreover, the capabilities of the parallel computing systems and computer networks would be underutilized if only a fixed number of applications were to be run on dedicated resources. One looks to parallel processing systems and computer networks to provide the flexibility to run a myriad of parallel scientific and/or commercial applications as needed.
The resource requirements for each parallel scientific or commercial application may be vastly different from each other. Furthermore, the communication and synchronization traits among the constituent tasks of different parallel applications can be equally diverse, from the one extreme, consisting of fine-grained tasks that require frequent communication and synchronization among tasks within an application, to the other extreme, comprising coarse-grained tasks which operate independently. Therefore, parallel computers, such as the IBM RISC System/6000 Scalable Power Parallel System 2 (hereinafter SP2), must support a wide variety of parallel applications, each with its own unique resource requirements. As a specific example, the interaction, synchronization, and communication among threads within fine-grained applications typically require the simultaneous allocation of their threads on computing nodes; whereas, the independent tasks of coarse-grained applications do not require simultaneous resource allocation. Both types of applications are scheduled (i.e., allocated) based upon system status and application characteristics, such as the number of tasks to be performed for the application, the execution time, required disk space, etc.
In order to perform efficient scheduling of resources, several scheduling methods have been devised for parallel scheduling of applications. The first is a xe2x80x9cspace sharingxe2x80x9d scheduling method under which the nodes are partitioned among different parallel jobs. Several space sharing strategies have been proposed in the past. The aforementioned static partitioning of nodes has been utilized in production systems, given the low system overhead and simplicity from both the system and application perspectives. Static space sharing of nodes can lead to low system throughputs and resource utilization under nonuniform workloads. System performance can be further improved by adaptively determining the number of nodes allocated to a job based on the system state when the job arrives using adaptive partitioning. The performance benefits of adaptive space sharing are somewhat limited and it generally cannot respond to subsequent workload changes. Another scheme, so-called space sharing with dynamic partitioning, will partition and repartition resources upon all entries and exits of applications as well as throughout their execution. This scheme can maintain very efficient resource utilizations. However, if the frequency of repartitions is not controlled, the associated overhead can limit, and even eliminate, the potential benefits.
Another scheduling scheme is xe2x80x9ctime sharingxe2x80x9d wherein the nodes are rotated among a set of jobs, which ensures that all jobs gain access to the system resources within a relatively short period of time. Time sharing can be effective for tasks with small processing requirements, but may not be particularly suitable for applications with large data sets. Researchers have recognized the benefits of attempting to combine space and time sharing scheduling methods, in a so-called gang scheduling system.
Yet another scheduling methodology, for scheduling coarse-grained and sequential applications, is xe2x80x9cload sharingxe2x80x9d which attempts to balance the load among the nodes and thereby reduce mean response time. As previously noted, coarse-grained applications require less interaction and synchronization than do fine-grained applications, so that the nodes operate independently once the application tasks have been assigned.
All of the foregoing scheduling policies have their advantages and disadvantages. It is generally agreed that there is no single scheduling scheme that is best for all application requirements. A modern parallel system should be able to support many different scheduling schemes simultaneously, one for each different class of application. The resource management system should be able to partition the available resources across the different scheduling schemes in a way that meets system objectives, including, but not limited to, maximizing all resource utilization, providing the best overall mean response time, and providing the optimal system throughput. In addition, the resource management system should be able to monitor the state of the system and dynamically and efficiently adjust the resources allocated to the different partitions in response to changes in the workload and changes in the demands on the resources.
What is desirable is an allocation method which supports multiple scheduling schemes and provides for management of schedulable resources across scheduling partitions.
It is additionally desirable that the allocation method be applicable to all computer environments, including shared-memory and distributed-memory systems, scientific and commercial workload environments, and loosely-coupled and tightly-coupled parallel architectures.
The invention provides a resource management mechanism, hereinafter referred to as Flexible Dynamic Partitioning (FDP), to allocate and reallocate resources among scheduling schemes of many types for the multi-computing environments. Resources can include, but are not limited to, processors, disks and communications connections. Partitioning of resources can be initiated by both application and system triggers, which are system administrator and/or user-defined. Examples of application triggers are application entries/exits or resource demand changes. Examples of system triggers include timers, resource utilization differential functions, and faults. Once dynamic partitioning is triggered, FDP allows a partition to invoke a set of resource allocation functions associated with its partition. The reallocation function, which is user-defined or defined by the system administrator, may perform a set of resource matchings and determine the necessary resource movement among partitions. The reallocation function can be a generalized function applicable to all scheduling partitions, or a unique function for each group of partitions.