Computing technology has advanced at a remarkable pace, with each subsequent generation of computing system increasing in performance, functionality, and storage capacity, often at reduced cost. However, despite these advances, many scientific and business applications still demand massive computing power, which can only be met by extremely high performance computing (HPC) systems. One particular type of computing system architecture that is often used in high performance applications is a parallel processing computing system.
Generally, a parallel processing computing system comprises a plurality of physical computing nodes and is configured with an HPC application environment, e.g., including a runtime environment that supports the execution of a parallel application across multiple physical computing nodes. Some parallel processing computing systems, which may also be referred to as massively parallel processing computing systems, may have hundreds or thousands of individual physical computing nodes, and provide supercomputer class performance. Each physical computing node is typically of relatively modest computing power and generally includes one or more processors and a set of dedicated memory devices, and is configured with an operating system instance (OSI), as well as components defining a software stack for the runtime environment. To execute a parallel application, a cluster is generally created consisting of physical computing nodes, and one or more parallel tasks are executed within an OSI in each physical computing node and using the runtime environment such that tasks may be executed in parallel across all physical computing nodes in the cluster.
Performance in parallel processing computing systems can be dependent upon the communication costs associated with communicating data between the components in such systems. Accessing a memory directly coupled to a processor in one physical computing node, for example, may be one or more orders of magnitude faster than accessing a memory on different physical computing node. In addition, retaining the data within a processor and/or directly coupled memory when a processor switches between different tasks can avoid having to reload the data. Accordingly, organizing the tasks executed in a parallel processing computing system to localize operations and data and minimize the latency associated with communicating data between components can have an appreciable impact on performance. For example, tasks can be assigned or bound to particular processors or physical nodes using a concept commonly referred to as affinity such that the tasks will be scheduled for execution if at all possible on the processors or physical nodes to which such tasks have an affinity.
Likewise, performance can be impacted by the relationship between tasks and other types of components in a parallel processing computing system. As one example, parallel processing computing systems may support multiple input/output (IO) adapters, e.g., network adapters for communication of data over a network. Furthermore, as with distributed processors and memories through the multiple physical computing nodes of a parallel processing computing system, distributing network adapters in this manner may result in variations in latency and bandwidth for tasks accessing such network adapters based upon where the tasks are executed relative to where the network adapters are located. Accordingly, tasks may also be assigned or bound to particular network adapters in a system based upon adapter affinity.
In some parallel processing computing systems, however, the physical locations of network and other IO adapters resident in such systems may not be available for task scheduling purposes. As such, in such systems it may not be possible to schedule tasks in a manner that optimizes or at least considers adapter performance.