This invention relates to scheduling of resources in a cluster computing environment. More specifically, it relates to providing a general and scalable gang scheduling methodology having time-sharing and space-sharing components integrated together to optimize system/application performance for running reconfigurable applications across distributed systems.
Allocation of computer resources to parallel-running tasks is a challenge for systems of all sizes. In a massively parallel processing system, as well as in a network of computers, a relatively large number of separate processing elements are interconnected to simultaneously process a large number of tasks at speeds far exceeding those of conventional computers. However, to perform parallel operations efficiently, it is necessary to have the capability of allocating the resources among different tasks as needed.
Carving out or allocating parts of the system to run tasks without interfering with each other is commonly referred to as xe2x80x9cspace sharingxe2x80x9d or xe2x80x9cpartitioning.xe2x80x9d Partitioning, in general, is the ability to divide up system resources into groups in order to facilitate particular management functions. The structure of massively distributed parallel processing systems provides the opportunity to partition the system into groups of nodes for various purposes.
The resource requirements for each parallel scientific or commercial application may be vastly different from each other. Furthermore, the communication and synchronization traits among the constituent tasks of different parallel applications can be equally diverse, from the one extreme, consisting of fine-grained tasks that require frequent communication and synchronization among tasks within an application, to the other extreme, comprising coarse-grained tasks which operate independently. Therefore, parallel computers, such as the IBM RISC System/6000 Scalable Power Parallel System family, must support a wide variety of parallel applications, each with its own unique resource requirements. As a specific example, the interaction, synchronization, and communication among tasks within fine-grained applications typically require the simultaneous allocation of their tasks on computing nodes; whereas, the independent tasks of coarse-grained applications do not require simultaneous resource allocation. Both types of applications are scheduled (i.e., allocated) based upon system status, workload characteristics, and application characteristics, such as the number of tasks to be performed for the application, the execution time, required disk space, etc.
In order to perform efficient scheduling of resources, several scheduling methods have been devised for managing the execution of parallel applications. The first is a xe2x80x9cspace sharingxe2x80x9d scheduling method under which the nodes are partitioned among different parallel jobs. Several space sharing strategies have been proposed in the past. Static partitioning of dedicated nodes has been utilized in production systems, given the low system overhead and simplicity from both the system and application perspectives. Static space sharing of nodes can, however, lead to low system throughputs and poor resource utilization under nonuniform workloads. System performance can be improved by adaptively determining the number of nodes allocated to a job based on the system state at the time at which the job arrives, and adaptively determining the number of nodes allocated to a set of waiting jobs when a job departs. The performance benefits of adaptive partitioning can be somewhat limited and such a system generally cannot respond to subsequent workload changes.
Another scheme, so-called dynamic space sharing, partitions and re-partitions resources upon all entries and exits of applications, as well as throughout their execution. Such a scheme can maintain very efficient resource utilizations. However, if the frequency of re-partitioning is not controlled, the associated overhead can limit, and even eliminate, the potential benefits.
Another scheduling scheme is xe2x80x9ctime sharingxe2x80x9d wherein the nodes are rotated among a set of jobs, each executing for a specified amount of time, thereby ensuring that all jobs gain access to the system resources within a relatively short period of time. Time sharing can be effective for tasks with mixed processing requirements, but may not be particularly suitable for applications with large data sets if the frequency of rotation is high.
What is desirable is a system and method for providing gang scheduling which realizes the benefits of both dynamic space sharing and time sharing across distributed computer systems.
The article xe2x80x9cScheduling Techniques for Concurrent Systemsxe2x80x9d by John Ousterhout, Proceedings of the Third International Conference on Distributed Computing Systems, pp. 22-30 (October 1982), details one gang scheduling approach wherein slots of time on all available processors are arranged in a matrix as illustrated in FIG. 1. An incoming job, such as 11, is plugged into the matrix/schedule, 10, wherever sufficient time was available, slot 12. Under the Ousterhout scheduling scheme, there is no subpartition of the matrix such that additions and deletions of processors to the matrix, or additions or terminations of jobs, may affect every node column in the matrix. In addition, central control and requisite global transmission of changes are utilized; and cancellation of a job may result in fragmentation (i.e., open time slots) with underutilized processor time (e.g., cancellation of job, 14, will leave the time slot empty and the processor(s) underutilized).
It is therefore an objective of the present invention to provide a general and scalable gang scheduling methodology with a decentralized approach for implementation on distributed computer systems.
It is yet another objective of the present invention to provide a gang scheduling methodology which adaptively allocates resources to tasks, or gang members, in the manner of dynamic space sharing.
It is also an objective of the present invention to provide a decentralized method to perform coordinated context switching of gang members using local logical clock mechanisms.
In addition to providing a gang scheduling mechanism for a distributed system, it is an objective of the present invention to provide gang scheduling in partitions which can undergo dynamic re-partitioning in accordance with the FDP mechanism set forth in the present authors co-pending application entitled xe2x80x9cFlexible Dynamic Partitioning of Resources in a Cluster Computing Environment,xe2x80x9d Ser. No. 08/862,454, filed May 27, 1997, and assigned to the present assignee, the teachings of which are hereby expressly incorporated by reference.
It is additionally an objective that the scheduling method be applicable to all computer environments, including shared-memory and distributed-memory systems, scientific and commercial workload environments, and loosely-coupled and tightly-coupled parallel architectures.
The invention provides a general and flexible system and method whereby a gang scheduling entity assembles a schedule for application tasks to be run on a plurality of distributed parallel processors. The scheduling information, including at least a mapping of applications to processors, and potentially an ordered set or sub-schedule of tasks to be run and the length of the time slices for context switching, are provided to each of the lower level schedulers for the relevant processors by a higher level scheduler, and the lower level schedulers assure that the tasks are executed. When a detailed sub-schedule of tasks is provided by the higher level scheduler, the lower level schedulers assure that the tasks execute in the preset order in accordance with the subschedule, with context switching between tasks being controlled at the decentralized node level. When only a mapping of applications to processors is provided by the higher level scheduler, there is additional flexibility and generality by allowing the lower level schedulers to make all or any subset of the scheduling decisions. Tasks from multiple applications can be scheduled simultaneously, integrating both time sharing and space sharing schemes in the schedule. The inventive system is dynamically reconfigurable in the event of addition or deletion of processors in the gang scheduling partition. It is also dynamically reconfigurable into subpartitions in response to changing application requirements which may result in reassignment of processors among groups of applications.