It is well-accepted in the computer arts that greater computer system performance can be achieved by harnessing the processing power of multiple individual processing units. Multi-processor (MP) computer systems may implement a number of different topologies, of which various ones may be better suited for particular applications depending upon the performance requirements and software environment of each application. One common MP computer architecture is a symmetric multi-processor (SMP) architecture in which multiple processing units, each supported by a multi-level cache hierarchy, share a common pool of resources, such as a system memory and input/output (I/O) subsystem, which are often coupled to a shared system interconnect.
Although SMP computer systems permit the use of relatively simple inter-processor communication and data sharing methodologies, SMP computer systems have limited scalability. For example, many SMP architectures suffer to a certain extent from bandwidth limitations, especially at the system memory, as the system scale increases.
An alternative MP computer system topology known as non-uniform memory access (NUMA) has also been employed to addresses limitations to the scalability and expandability of SMP computer systems. A conventional NUMA computer system includes a switch or other global interconnect to which multiple nodes, which can each be implemented as a small-scale SMP system, are connected. Processing units in the nodes enjoy relatively low access latencies for data contained in the local system memory of the processing units' respective nodes, but suffer significantly higher access latencies for data contained in the system memories in remote nodes. Thus, access latencies to system memory are non-uniform. Because each node has its own resources, NUMA systems have potentially higher scalability than SMP systems.
Regardless of whether an SMP, NUMA or other MP data processing system architecture is employed, it is typical that each processing unit accesses data residing in memory-mapped storage locations (whether in physical system memory, cache memory or another system resource) by utilizing real addresses to identifying the storage locations of interest. An important characteristic of real addresses is that there is a unique real address for each memory-mapped physical storage location.
Because the one-to-one correspondence between memory-mapped physical storage locations and real addresses necessarily limits the number of storage locations that can be referenced by software, the processing units of most commercial MP data processing systems employ memory virtualization to enlarge the number of addressable locations. In fact, the size of the virtual memory address space can be orders of magnitude greater than the size of the real address space. Thus, in a conventional systems, processing units internally reference memory locations by the virtual (or effective) addresses and then perform virtual-to-real address translations (often via one or more intermediate logical address spaces) to access the physical memory locations identified by the real addresses.
Given the availability of the above MP systems, one further development in data processing technology has been the introduction of parallel computing. With parallel computing, multiple processor nodes are interconnected to each other via a system interconnect or fabric. These multiple processor nodes are then utilized to execute specific tasks, which may be individual/independent tasks or parts of a large job that is made up of multiple tasks.
In such systems, coordinating communications between the multiple processor nodes is of paramount importance for ensuring fast and efficient handling of workloads. Communication loss between coordinating processes on different computation nodes (e.g., user jobs or OS instances) has been found to lead to delay/loss of job progress, lengthy recovery, and/or jitter in the system, effectively wasting computing resources, power and delaying the eventual result.
Various MP system technologies utilize different types of communication channels to support communication between coordinating processes. For example, in MP systems implemented as high performance computing (HPC) clusters, communication channels may be implemented as “windows” that are available on one or more Host Fabric Interface (HFI) adapters. In other types of HPC clusters, the communication channels may be implemented as Queue Pairs on a Host Channel Adapter (HCA).
To address potential communication losses, some MP systems dispatch multiple identical copies of compute jobs across different computation nodes. However, doing so doubles CPU/memory resources and bandwidth usage, and requires merging/discarding results coming back from multiple sources.
Other MP systems utilize multiple active communication channels in an active/active round robin configuration. Doing so, however, additional channel resources to be assigned per end-client (compute job), additional resources to manage multiple channels, and additional overhead in user jobs or OS libraries to manage merging communications streams. Moreover, any operations queued to failed hardware will often be lost, as failure of one channel often may only be detected by a long-interval software timer.
In still other MP systems, multiple communication channels may be utilized in an active/passive configuration. However, such solutions require additional channel resources to be assigned per end-client (compute job), most of which are never used. Additional resources are also typically required to manage multiple channels, and any operations queued to the failed hardware will typically be lost. In addition, failure of one channel typically may only be detected with a long-interval software timer.
Therefore, a substantial need exists in the art for an improved manner of handling communication channel failures in an HPC cluster or other MP system, particularly for a manner of handling communication channel failures reduces the time to fail over, reduces the number of dropped packets, reduces the need for additional dedicated resources, and/or allows for more configuration flexibility than conventional approaches.