Computing system technology has advanced at a remarkable pace with each generation of computing system increasing in performance, functionality, and storage capacity, often at a reduced cost. Despite these many advances, many scientific and business applications still demand massive computing power, which is typically only met by high performance computing systems. One particular type of computing system architecture that is capable of filling this requirement is a parallel processing computing system.
A conventional parallel processing computing system includes a plurality of computing nodes. Some parallel processing computing systems may have hundreds or thousands of individual computing nodes. Each computing node is generally of modest computing power and typically includes one or more processing units, or computing cores. As such, each computing node may be a computing system configured with an operating system and at least a portion of a distributed application. The distributed application subdivides a workload into tasks and provides the task or tasks to each computing node. Thus, the parallel processing computing system completes a workload by configuring the computing nodes to cooperatively perform one or more tasks such that the workload is processed substantially in parallel.
Parallel computing systems generally comprise a plurality of nodes each configured with various hardware resources. To overcome hardware resource failure, computing nodes may include redundant hardware resources. Those of ordinary skill in the art will recognize that redundancy with respect to a hardware resource means that the computing node includes one or more additional hardware resources than is generally required for operation. Thus, if a hardware resource failure occurs in a computing node, the computing node is able to use a redundant hardware resource to continue to function. The redundant hardware resources present in a computing node thereby increase its resiliency. Often in parallel computing systems, it is desirable for the distributed application to assign tasks to computing nodes that are redundant with respect to one or more hardware resources.
In parallel computing systems, it is also desirable to group computing nodes in the computing system into virtual system pools within the parallel computing system. Generally, computing nodes with similar hardware resource configurations may be grouped into virtual system pools, such that the distributed application may distribute tasks requiring a certain resource configuration to computing nodes assigned to a virtual system pool which has computing nodes with the desired resource configuration assigned to it. Hence, grouping computing nodes into virtual system pools allows the distributed application to more efficiently assign tasks to computing nodes in the computing system. In conventional systems, the assignment of computing nodes based on the hardware resources configured thereon is performed by a system administrator. Hence, the system administrator may evaluate the hardware resource configurations of computing nodes in the system and assign the computing nodes to one or more virtual system pools.
While grouping computing nodes of the computing system into virtual system pools may increase efficiency of the system, manual analysis and assignment by a system administrator becomes very time consuming in large parallel computing systems. Moreover, manual analysis and assignment by a system administrator may often lead to erroneous assignment of computing nodes to a virtual system pool, which may decrease the efficiency of the system. In addition, the system administrator also must update and manage the virtual system pool in light of events that might change the configuration of hardware resources on the computing node (i.e. additions of new hardware resources in a computing node, failure of hardware resources in a computing node, replacement of hardware resources in a computing node, etc.).
As computing nodes in the computing system are assigned a task to perform by the distributed application, hardware resources of the computing node are utilized, and the computing node becomes less available to perform additional tasks, while other computing nodes become more available in comparison because the computing nodes have not yet been assigned tasks to perform. Hence, some computing nodes become highly available to perform a task, while other computing nodes become less available as tasks are assigned to them to be performed.
Consequently, there is a continuing need in the art for a way to identify and efficiently group computing nodes.