Multiprocessor systems have been developed in the past in order to increase processing power. Multiprocessor systems comprise a number of central processing units (CPUs) working generally in parallel on portions of an overall task. Some jobs such as parallel jobs generally have tasks which can be executed in parallel by more than one CPU.
A particular type of multiprocessor system used in the past has been a symmetric multiprocessor (SMP) system. An SMP system generally has a plurality of processors, with each processor having equal access to shared memory and input/output (I/O) devices shared by the processors. An SMP system can execute jobs quickly by allocating to different processors parts of a particular job. To further increase processing power, processing machines have been constructed comprising a plurality of SMP nodes. Each SMP node includes one or more processors and a shared memory. Accordingly, each SMP node is similar to a separate SMP system. In fact, each SMP node need not reside in the same host, but rather could reside in separate hosts.
In the past, SMP nodes have been interconnected in some topology to form a machine having non-uniform memory access (NUMA) architecture. A NUMA machine is essentially a plurality of interconnected SMP nodes located on one or more hosts, thereby forming a cluster of node boards on one or more hosts.
Generally, each SMP node is interconnected and cache coherent so that a processor on any other SMP node can access the memory in an SMP node. However, while a processor can access the shared memory on the same SMP node uniformly, meaning within the same amount of time, processors on different boards cannot access memory on other boards uniformly. Accordingly, an inherent characteristic of NUMA machines and architecture is that not all of the processors can access the same memory in a uniform manner. In other words, while each processor in a NUMA system may access the shared memory in any SMP node in the machine, this access is not uniform.
Computers or machines where the link speed between CPUs or other resources such as memory can change and is not homogenous for each CPU, such as NUMA machines, can be defined more generally as machines that have topological properties. For example, a machine may be considered to have topological properties if two identical jobs placed on the same machine, but being executed by different central processing unit (CPU) sets, may exhibit different performance characteristics. Therefore, NUMA machines will, by definition, have topological properties because not all of the processors can access the same memory in a uniform manner. Other types of machines may also have topological properties depending on the link speed between the CPUs, memory and other resources. By contrast, some types of machines, which have non-trivial topologies, could nevertheless not exhibit topological properties simply because the link speeds are so high that the resulting latency is trivial such that the resources can be considered as homogenous. However, the resources on most present high-performance computers (HPC) cannot be treated as homogenous resources because of the complexity of the underlying topology of the machines and therefore most present HPC machines exhibit topological properties.
This non-uniform access results in a disadvantage in machines having topological properties, such as NUMA systems, in that a latency is introduced each time a processor accesses shared memory, depending on the actual topology of the combination of CPUs and nodes upon which a job is scheduled to run. In particular, it is possible for program pages to reside “far” from the processing data, resulting in a decrease in the efficiency of the system by increasing the latency time required to obtain this data. Furthermore, this latency is unpredictable because it depends on the location where the shared memory segments for a particular program may reside in relation to the CPUs executing the program. This affects performance prediction, which is an important aspect of parallel programming. Therefore, without knowledge of the topology, performance problems can be encountered in NUMA machines.
Prior art devices have attempted to overcome the deficiencies inherent in NUMA systems and other systems exhibiting topological properties in a number of ways. For instance, programming tools to optimize program page and data processing have been provided. These programming tools for programmers assist a programmer to analyze their program dependencies and employ optimization algorithms to optimize page placement, such as making memory and process mapping requests to specific nodes or groups of nodes containing specific processors and shared memory within a machine. While these prior art tools can be used by a single programmer to optimally run jobs in a NUMA machine, these tools do not service multiple programmers well. Rather, multiple programmers competing for their share of machine resources may conflict with the optimal job placement and optimal utilization of other programmers using the same NUMA host or cluster of hosts.
To address this potential conflict between multiple programmers, prior art systems have provided resource management software to manage user access to the memory and CPUs of the system. For instance, some systems allow a specific programmer to “reserve” CPUs and shared memory within a NUMA machine. One such prior art system is the Miser™ batch queuing system that chooses a time slot when specific resource requirements, such as CPU and memory, are available to run a job. However, these batch queuing systems suffer from the disadvantage that they generally cannot be changed automatically to re-balance the system between interactive and batch environments. Also, these batch queuing systems do not address job topology requirements that can have a measurable impact on the job performance.
Another manner to address this conflict has been to use groups of node boards, which are occasionally referred to as “CPUsets” or “processor sets”. Processor sets specify CPU and memory sets for specific processes and have the advantage that they can be created dynamically out of available machine resources. However, processor sets suffer from the disadvantage that they do not implement any resource allocation policy to improve efficient utilization of resources. In other words, processor sets are generally configured on an ad-hoc basis.
Furthermore, a disadvantage of the above-noted prior art systems is that they do not generally provide for non-trivial scheduling. Non-trivial scheduling can be defined in general as scheduling which requires more than one scheduling cycle in which to schedule a job. Non-trivial scheduling may involve not scheduling a job or suspending a job to execute other jobs.
Some typical examples of non-trivial scheduling include reserving resources for future use, such as if a large job that needs many resources is to be scheduled, the scheduler can commence reserving the resources so that sufficient resources are available to run the large job when other resources which may be executing other jobs or are otherwise occupied, become available. In this situation, a large job may be held back for several scheduling cycles until sufficient resources have become reserved and/or are available to execute the job. Another non-trivial scheduling technique is sometimes referred to as “backfill” which involves running small jobs on resources which have been temporarily reserved for large jobs until sufficient resources become available for the large job to be run. For example, if three slots have already been reserved for a large job which requires six slots in order to operate, the three slots which have been reserved can be “backfilled” with smaller jobs which only require one, two or three slots to execute. In this case, the scheduler can run smaller jobs as backfill on the reserved resources until there are sufficient resources available to run the larger job.
Other non-trivial scheduling techniques include pre-emption which is a policy whereby two jobs can be assigned different priorities and the lower priority job that is running may be suspended or pre-empted in order to execute the larger priority job. This can be used, for example, when a higher priority job requires access temporarily to resources which are already being executed by a lower priority job.
While non-trivial scheduling has been used in the past, non-trivial scheduling has not been implemented successfully to date with machines having topological properties. In other words, to the extent that non-trivial scheduling has been used in the part with machines having topological properties, the topological properties have been ignored potentially resulting in inefficient allocation of jobs and/or conflicting execution of jobs requiring jobs to be rescheduled again. In essence, utilizing non-trivial scheduling of jobs, while ignoring topological properties results in “hit or miss” scheduling which may or may not be successful depending on whether or not the topological properties become a factor.
A further disadvantage common to some prior art resource management software for NUMA machines is that they do not consider the transient state of the NUMA machine. In other words, none of the prior art systems consider how a job being executed by one SMP node or a cluster of SMP nodes in a NUMA machine will affect execution of a new job. To some extent, these disadvantages of the prior art have been addressed by the system disclosed in U.S. patent application Ser. No. 10/053,740 filed on Jan. 24, 2002 and entitled “Topology Aware Scheduling for a Multiprocessor System”, which has been assigned to the same applicant and is hereby incorporated herein by reference. This U.S. application discloses a manner in which the status of various node boards within a host having a certain topology can be used in order to allocate resources such as by providing the number of free CPUs for different radii, or, calculating the distance between free CPUs calculated in terms of the delay in various interconnections. However, U.S. patent application Ser. No. 10/053,740 does not specifically address non-trivial scheduling in a multiprocessor system having topological properties.
Furthermore, in the past, programmers had the ability to specify, together with the job, the number of CPUs required to execute the job. However, more advanced systems permit the programmers to specify not only the number of CPUs but also characteristics of the placement of the number of nodes or number of CPUs per node to use, whether or not the placement is to be contiguous, to start from a specific node and/or to use the same CPU indicies on all the nodes used for the allocation. For example, more advanced architecture permits programmers to specify not only the plain number of CPUs to use for the job, but also to specify the allocation shape of the resources to be used in two dimensions and to a certain extent the allocation efficiency for the job. To date, the prior art systems have failed to permit scheduling jobs which have been specified by a programmer to have a certain shape, or to do so using non-trivial scheduling with a system having topological properties.
Accordingly, there is a need in the art for a scheduling system which can dynamically schedule jobs and allocate resources, but which is nevertheless governed by a policy to improve efficient allocation of resources and accommodate non-trivial scheduling. Also, there is a need in the art for a system and method to have the ability to be implemented by multiple programmers competing for the same resources. Furthermore, there is a need in the art for a method and system to schedule and dispatch jobs, which specify the allocation shape of the resources in more than one dimension while accommodating topological properties of the machine. Furthermore, there is a need in the art for a method, system and computer program product which can dynamically monitor the status of a machine having topology properties and schedule and dispatch jobs in view of transient changes in the status of the topology of the machine.