Computing system technology has advanced at a remarkable pace recently, with each subsequent generation of computing system increasing in performance, functionality, and storage capacity, often at reduced cost. However, individual computing systems are still generally expensive and incapable of providing the raw computing power that is often required by modern requirements. One particular type of computing system architecture that generally fills this requirement is that of a parallel processing computing system.
Generally, a parallel processing computing system comprises a plurality of computing cores and is configured with one or more distributed applications. Some parallel processing computing systems, which may also be referred to as massively parallel processing computing systems, may have hundreds or thousands of individual computing cores, and provide supercomputer class performance. Each computing core is typically of modest computing power and generally includes one or more processing units. Each computing core may be incorporated into a dedicated processing node, or each computing core may be a computing system. A distributed application provides work for each computing core and is operable to control the workload of the parallel processing computing system. Generally speaking, a distributed application provides the parallel processing computing system with a workload can be divided into a plurality of tasks. Each computing node is typically configured to process one or more tasks. However, each task is typically further divided into one or more execution contexts, where each computing core of each computing node is typically configured to process one execution context and therefore process, or perform, a specific function. Thus, the parallel processing architecture enables the parallel processing computing system to receive a workload, then configure the computing cores to cooperatively perform one or more tasks and/or configure computing cores to each process one execution context such that the workload supplied by the distributed application is processed.
Parallel processing computing systems have found application in numerous different computing scenarios, particularly those requiring high performance and fault tolerance. For instance, airlines rely on parallel processing to process customer information, forecast demand, and decide what fares to charge. The medical community uses parallel processing computing systems to analyze magnetic resonance images and to study models of bone implant systems. As such, parallel processing computing systems typically perform most efficiently on work that contains several computations that can be performed at once, as opposed to work that must be performed serially. The overall performance of the parallel processing computing system is increased because multiple computing cores can handle a larger number of tasks in parallel than could a single computing system. Other advantages of some parallel processing systems include their scalable nature, their modular nature, and their improved level of redundancy.
Resources in a parallel processing computing system are often configured in a hierarchical fashion, with several processing units disposed within a chassis, several chassis disposed within a super-node, and several super-nodes comprising the distributed computing system. In one embodiment of a blade computing system, each computer host, or blade, includes memory and a processor but does not include dedicated I/O modules. Instead, the entire set of hosts within the chassis share a common set of I/O modules, providing a modular approach to computing by separating out the I/O modules from the CPU. Using this approach, racks of blades can potentially access multiple I/O modules if a physical connection can be made from the blade to the I/O device. This is possible because of new I/O virtualization technologies like Single Root IO Virtualization (SRIOV) and Multi Root IO Virtualization (MRIOV) which allow a single PCI Express adapter to be virtualized and shared by a single host (SRIOV) or multiple hosts (MRIOV) and the Virtual Machines (VMs) running on these hosts. In effect this creates an environment where VMs (workloads) and their PCIe resources can be re-allocated to different PCIe adapters much easier than in the past.
This type of system provides some unique challenges for determining what host a workload should run on to best take advantage of the available I/O resources. For example, consider a workload that requires one storage resource and one network adapter that may be executed in an exemplary system having 32 hosts, representing eight blades each on four chassis, each with enough capacity to handle the workload. Each of the four chassis in such a system may have, for example, three PCIe devices (twelve in total) also available to any of the hosts through a chassis interconnect fabric using MRIOV technology. Some of the twelve I/O modules may have connections to the IO needed by the workload while others may not.
Conventionally, in order to properly place the workload, an administrator typically must manually determine which I/O device will perform the best for a given host, a task that requires specific expertise and understanding of the configuration of the underlying system on the part of the administrator. Furthermore, this determination is often made more difficult when the I/O device's physical location is abstracted away through virtualization, since the administrator may not be able to readily ascertain where in a particular system each host and I/O device is physically located.
Often, it is desirable for performance reasons that a host of a particular workload be as physically close as possible to the I/O devices the workload needs. An I/O device that is in the same chassis as a host will typically require fewer relays than an I/O device that is outside the chassis, and thus, co-locating a workload in a host that is in the same chassis as a needed I/O device will generally lead to better performance. Often, the administrator in such a situation would have to figure out how the virtualized I/O devices map to the underlying physical I/O device, and then understand the end-to-end physical connections from the host to the I/O device to understand the affinity to the host in question.
Furthermore, from a high availability (HA) perspective, understanding the affinity of a given I/O device to a host is also important. For example, if one chassis contains two available I/O devices, while another contains one available I/O device, and it is desirable to ensure I/O high availability and redundancy in the case of the I/O device faulting out, the HA workload would be best suited for placement on the former chassis over the latter. Once again, a manual approach would require the administrator to understand the underlying physical I/O device and its connection topology in relation to each of the hosts. This manual approach is therefore untimely, laborious, and prone to errors.
Consequently, there is a continuing need to automate workload allocation tasks in such a manner that each task is hosted as close to its required resources as possible, thereby minimizing latency and maximizing performance.