Computing technology has advanced at a remarkable pace, with each subsequent generation of computing system increasing in performance, functionality, and storage capacity, often at reduced cost. However, despite these advances, many scientific and business applications still demand massive computing power, which can only be met by extremely high performance computing systems. One particular type of computing system architecture that is often used in high performance applications is a parallel processing computing system.
Generally, a parallel processing computing system comprises a plurality of computing nodes and is configured with a distributed application. Some parallel processing computing systems, which may also be referred to as massively parallel processing computing systems, may have hundreds or thousands of individual computing nodes, and provide supercomputer class performance. Each computing node is typically of modest computing power and generally includes one or more processing units, or computing cores. As such, each computing node may be a computing system configured with an operating system and distributed application. The distributed application provides work for each computing node and is operable to control the workload of the parallel processing computing system. Generally speaking, the distributed application provides the parallel processing computing system with a workload that can be divided into a plurality of jobs. Typically, each computing node, or each computing core, is configured to process one job and therefore process, or perform, a specific function. Thus, the parallel processing architecture enables the parallel processing computing system to receive a workload, then configure the computing nodes to cooperatively perform one or more jobs such that the workload supplied by the distributed application is processed substantially in parallel.
Parallel processing computing systems have found application in numerous different computing scenarios, particularly those requiring high performance and fault tolerance. For instance, airlines rely on parallel processing to process customer information, forecast demand, and decide what fares to charge. The medical community uses parallel processing computing systems to analyze magnetic resonance images and to study models of bone implant systems. As such, parallel processing computing systems typically perform most efficiently on work that contains several computations that can be performed at once, as opposed to work that must be performed serially. The overall performance of the parallel processing computing system is increased because multiple computing cores can handle a larger number of tasks in parallel than could a single computing system. Other advantages of some parallel processing systems include their scalable nature, their modular nature, and their improved level of redundancy.
When processing a workload, computing nodes of a parallel processing computing system typically operate to process each job of the workload as fast as possible while keeping as few computing nodes active as possible to process the workload. During this processing, these computing nodes typically consume a large amount of power as well as generate a large amount of heat. As such, large and complex air handling systems must be designed and installed to keep the room, or rooms, where a parallel processing computing system is installed at an acceptable temperature. Similarly, large and complex power circuits must be designed and installed to keep the computing nodes supplied with sufficient power to process jobs. However, conventional work scheduling algorithms for jobs generally fail to take heat generation from these computing nodes and power circuits issues into account. Conventional work scheduling algorithms similarly generally fail to take energy consumption issues into account. Conventional work scheduling algorithms generally attempt to keep as many nodes idle as long as possible, forcing jobs onto as few nodes as possible. As a result, though the average temperature of a data center and/or system may be within an acceptable temperature, localized areas of heat generation and energy usage typically arise. These heat and energy “islands” often increase the wear on components, and generally result in increased maintenance, component replacement, and cost to use parallel processing computing systems.
Consequently, there is a need to schedule a workload of a parallel processing computing system in such a manner that manages job scheduling to reduce heat and energy islands that may otherwise arise.