Over the past five years, many researchers have developed new software systems for improved communication in multiprocessor systems. The goal has been to achieve low latency and high bandwidth. The approach has been to remove much of the redundancy in protocol stacks, and eliminate buffer copying by integration of kernel-space with user-space buffer management. For details, see for example [1,2,3,14,16,20,21]. This work has demonstrated that impressive improvements in latency can be achieved when sending a single message or for sparse communication patterns. However, for many of the dense communication patterns which arise in practice, e.g., when large coarse-grain parallel applications are run on moderately parallel architectures, this locally controlled approach to communications management has not provided an effective and efficient solution.
Inefficient, locally controlled communication is a legacy of the dominance of synchronous message passing as a parallel programming style throughout the 1980s and early 1990s. If implemented directly, i.e., as the programmer specifies, then many message passing programs will be inherently inefficient in many cases. This is because any one process in a message passing program cannot efficiently coordinate its actions within the machine with all the other processes which are simultaneously computing, communicating and synchronizing. A process in a message passing program cannot normally combine messages with others for the same destination, nor can it tell if it is a good time to send because the network is lightly loaded, or a bad time because it is heavily congested.
Bulk synchronous parallelism (BSPY, described in [17,19] which are incorporated herein by reference in their entirety, is an alternative programming style in which a program is structured as a sequential composition of parallel xe2x80x9csuperstepsxe2x80x9d. BSP programs can achieve major communication performance improvements over equivalent message passing programs. These improvements are based on the use of global control techniques such as
Batch communication. Efficient oblivious routing of general communications as a series of structured communication patterns such as total exchange [7,8,9].
Repackaging. All messages for a given destination are combined. The combined message is packaged and sent in the most efficient form for the particular network [11].
Destination scheduling. The order in which messages are sent is globally managed and controlled in order to avoid contention at the destination nodes. The global control is non-oblivious, i.e., the scheduling strategy is dynamically determined by the system, taking account of the particular structure of the communication pattern to be realized. This information is determined from data sent during the BSP barrier synchronization [11,17].
Pacing. The injection of new messages into the network is globally controlled to ensure that the total applied load is maximized, but that it does not exceed the level at which network throughput will start to decrease. As with destination scheduling, a non-oblivious method is used to minimize contention [6].
Global implicit acknowledgments. The structure of a BSP computation allows an implementation to avoid sending unnecessary acknowledgments in certain cases. Suppose that the packets from one superstep are colored red, and those of the next superstep are colored green. If processor i has sent red packets to processor j which are currently unacknowledged, and processor i receives a green packet from processor k, then it knows that all processors must have (logically) passed through a global barrier synchronization, and it can regard all red packets as implicitly acknowledged [4].
These techniques rely on the semantics of the BSP model, and on exploiting knowledge about the global structure of the communications. With BSP, a process that is about to communicate can know exactly which other processes are about to communicate, and how much they plan to send, and can therefore know the global applied load. Exploiting this knowledge has been shown to significantly improve the performance of communications on point-to-point networks. This BSP-based globally controlled communication allows one to achieve low latency and high bandwidth in those situations which arise frequently in practical applications, where traffic is dense and irregular.
Globally controlled communication for single program systems has been used to develop high performance native implementations of BSPlib [10], a BSP communications library, for a number of cluster [3,15], symmetric multiprocessor (SMP) [3] and massively parallel processor [3] architectures, in particular, for the following parallel architectures: Silicon Graphics Origin 2000, Silicon Graphics Power Challenge, Cray T3D, Cray T3E, IBM SP2, Convex Exemplar, Sun Multiprocessor, Digital 8400, Digital Alpha Farm, Hitachi SR2001, Fujitsu AP1000. Generic implementations of BSPlib for single program systems have also been produced for any scalable parallel architecture which provides at least one of the following: TCP/IP, UDP/IP, Unix System V Shared Memory, MPI, Cray SHMEM, PARMACS. Papers [4,5,6,7,11,17] describe the above techniques for globally controlled communication in single program systems.
FIG. 1 illustrates the methods of message combining and destination scheduling in a single BSP program. It shows how the messages in a single BSP program can be combined and globally scheduled to avoid unnecessary overheads and minimize destination contention. We briefly describe this operation. In superstep t 100:
Processor P1105 has four messages, identified by the target numbers corresponding respectively to processors P2, P3, P4, P2.
Processor P2110 has two messages, one for P1, and one for P3.
Processor P3115 has two messages, one for P2, and one for P4.
Processor P4120 has five messages, for processors P2, P1, P2, P2 and P3.
When all of the processors have completed their local computations, and initiated all of their communications, a barrier synchronization 125 is performed.
On each processor, messages destined for the same processor are combined (Step 130). The injection of these combined messages into the network is scheduled to avoid destination contention. As shown in the figure, this can be done by first realizing the permutation
(P1xe2x86x92P2, P2xe2x86x92P3, P3xe2x86x92P4, P4xe2x86x92P1) (Step 135), then realizing
(P1xe2x86x92P3, P2xe2x86x92P4, P3xe2x86x92P1, P4xe2x86x92P2) (Step 140), then finally realizing
(P1xe2x86x92P4, P2xe2x86x92P1, P3xe2x86x92P2, P4xe2x86x92P3) (Step 145).
When a processor has received all of its incoming messages 150 it can begin the local computations of superstep t+1 155.
Efficient global communication in the BSP style is discussed in U.S. Pat. No. 5,083,265, xe2x80x9cBulk-Synchronous Parallel Computer,xe2x80x9d issued to Leslie G. Valiant and incorporated herein by reference in its entirety. A global mechanism for realizing complex patterns of communication for which it is efficacious for packets to a common destination to be combined at nodes other than the sender or receiver is discussed in U.S. Pat. No. 5,608,870, xe2x80x9cSystem for combining a plurality of requests referencing a common target address into a single combined request having a single reference to the target address,xe2x80x9d also issued to Leslie G. Valiant and incorporated herein by reference in its entirety. The techniques and mechanisms for realizing multiprogrammed BSP machines that are the substance of the present invention are not, however, anticipated in either of these earlier patents.
Work on the BSP model [17,19] has also shown that the resources available on any multiprocessor system can be accurately characterized in terms of a concise xe2x80x9cmachine signaturexe2x80x9d which describes number of processors, processor speed, global communications capacity, global synchronization cost (p,s,g,l respectively), and possibly several other measures such as memory capacity, input/output capacity, and external connectivity.
The BSP model also permits the resources required by any parallel job to be characterized by a concise xe2x80x9cjob signaturexe2x80x9d of the form (W,H,S), where W,H,S denote the computation, communication and synchronization requirements respectively [13]. The cost of the job when run on a machine with BSP parameters g and l will be W+Hxc2x7g+Sxc2x7l. The job signature may have additional components, characterizing the memory or I/O requirements, for example.
Given the job signature of a program it is straightforward to evaluate its runtime, or some other resource consumption measure, for any machine with a given machine signature. For example, the percentage processor utilization we will achieve when running a BSP code with job signature (W,H,S) on a single program BSP system with parameters g and l is given by the simple formula 100/(1+g/G+l/L) where G=W/H and L=W/S. Since machines with higher values of g and l are in general less expensive, this provides a methodology for quantifying how cost-effective it is to run a particular job on a particular machine.
Reference [13] provides a number of machine signatures for various BSP systems (Table 1), analytic job signatures for BSP algorithms (Table 2), and experimental job signatures for BSP programs (Table 3). It also shows the percentage processor utilization for various program-system pairs (Table 4).
A large number of job scheduling systems for sequential and multiprocessor systems have been developed over the past thirty years. The design of these systems have been based on various assumptions, many of which are no longer relevant. Most such systems assume, for example, that:
It is most efficient to allow jobs to run to completion once they have started.
No accurate description of the resources required by the job is available to the scheduler.
All jobs can be simply and easily classified into a small number of standard categories. For example, many systems classify jobs simply as either interactive or batch.
A simple, statically defined scheduling policy will be adequate in all circumstances.
It is sufficient to monitor and control only the computation resources in a multiprocessor system.
Reference [12] provides an up-to-date, detailed and systematic analysis of the capabilities and limitations of six major job scheduling systems, including systems from leading research and development centers and commercial products from IBM, Platform Computing, GENIAS and Cray Research (now part of Silicon Graphics). The analysis provides detailed information on how well the various systems perform, measuring each one with respect to 62 different criteria. The results show clearly that on multiprogrammed multiprocessor systems in which the processors communicate with each other via some communication hardware, current job scheduling systems have a number of shortcomings. For example, they do not offer any of the flexibility and performance which can be obtained if one is able to suspend and resume a running job some number of times before it runs to completion.
The shared resources (computation, communication, memory, input/output) in a modern multiprocessor system can be regarded as a flexible collection of commodity resources. This view of a multiprocessor system permits a much wider range of types of tasks to coexist in a single shared system than has hitherto been the case. The workload at a particular time may consist of some complex mixture of jobs with widely differing resource requirements (compute intensive, communication intensive, memory-bound, input/output bound, highly synchronized, asynchronous, interactive, highly parallel, sequential, short duration, long running, constantly running, etc.) and with widely differing level of service requirements (absolute priority, relative priority, real-time, interactive, fixed deadline, batch, low cost/best effort, etc.) This new situation requires a much more flexible and dynamic approach to job scheduling on shared multiprocessor systems, one which is able to take account of the widely differing resource and level of service requirements of the jobs, and which is able to guarantee that service levels, once agreed, will be met.
The new design presented here can, for the first time, achieve the above goals.
We describe a new design for a multiprogrammed multiprocessor system with globally controlled communication and signature controlled scheduling. The design is based on a completely new multiprogrammed bulk synchronous parallel (MBSP) approach to scalable parallel computing. An MBSP multiprocessor operates in a strobing style, in which all of the processors in the system periodically barrier synchronize and perform any communication that has accrued from all the processes.
Any communication library, messaging interface or middleware system can be used to communicate between processors in the MBSP system. The whole collection of resulting communications is globally optimized, in a bulk synchronous manner, by the system.
The use of a single, unified run-time system for the management of communications offers major advantages in simplifying system management.
The design can be implemented on clusters, symmetric multiprocessors (SMPs), clusters of SMPs, and on mainframe architectures such as the System/390 architecture. It can be implemented to work with all standard operating systems.
The organization of communications in the design is based on a new five-layer communication model, which differs significantly from the standard OSI seven-layer reference model. The design pushes the error recovery process down into the network layer, and can take advantage of multiple networks and multiple network hardware units.
Generic parallel job signatures provide a concise characterization of the local computation and input/output, communication, memory and synchronization resources required by the job when run on the MBSP system. With such signatures, we can precisely and flexibly control the high level task scheduling in our system, enabling the system to offer a completely new and powerful form of service level management.
The system provides a universal, high-level control mechanism for a signature-driven task scheduler. The control mechanism uses a static or dynamic protocol between the task scheduler and an application agent to determine the quality of service to be provided to an application task when it is run on the system. This flexible and dynamic control mechanism enables individual application tasks, and the system as a whole, to achieve guaranteed quality of service across the whole range of service characteristics, including the ability to guarantee application deadlines, real-time and interactive response, absolute and relative priorities, application cost, or maximal system throughput.
Accordingly, in a preferred embodiment of the present invention, a multiprogrammed system comprises a plurality of processors and some communications resources through which the processors communicate with each other. A plurality of tasks can be executed on the system and the allocation of the communications resources among the tasks running on the system is globally controlled. The communications resources preferably comprise a plurality of networks.
Preferably, the allocation of resources among the tasks running on the system is dependent on the signature of the tasks. One component of a task signature is a measure of the communication resources needed by the task.
Furthermore, the scheduling of a task running on the system may be dependent on the signature of the task.
In a preferred embodiment of the present invention, the allocation of communications resources is globally controlled by one or more of the following techniques: packet injection into the communications resources using periodic strobing, or using global flow control; global implicit acknowledgments; destination scheduling; pacing; or prioritized communication scheduling.
Preferably, the overheads of error recovery are amortized over a plurality of jobs running at one node.
Preferably, a user interface, implemented for example in software, allows a plurality of service level options to be specified by a user, where the system can guarantee that the service levels can be achieved. The user interface allows an application user as well as the system administrator to exercise options that are appropriate to their respective roles. The user interface preferably allows the system administrator to run a scheduling mechanism that distributes communications resources among the tasks according to a market mechanism. The user interface can allow a task to be guaranteed a fixed fraction of the resources independent of the other tasks then running. The user interface can allow a task to be run as an interactive continuous job at one of a plurality of service levels.
Preferably, the user interface allows a system administrator to have system resources subdivided into reserved and unreserved components, where the reserved components may be booked for exclusive use by users and user groups, and where the unreserved components are made available in accordance with some competitive or market mechanism.