1. Field
The present invention relates to fault-tolerant monitoring of networked computing elements. As computing systems grow increasingly large and complex, there is an increased risk that monitoring of a system may be disrupted by faults in individual computing elements. Fault-tolerant monitoring can be useful in a wide range of application areas, for example from simple computations to sensor networks, image rendering and large-scale, complex simulations, including on-the-fly and offline processing. As some important examples, mission-critical jobs (e.g. operational weather forecasting) or systems (e.g. the internet) with very many computing elements can benefit from fault-tolerant monitoring. This invention addresses the whole gamut of these application areas, and is focused particularly on distributed, parallel computer programs running on very large high-performance computing systems with data distributed over a number of CPUs.
2. Description of the Related Art
One example of such a distributed parallel application is simulation. In many simulations, an iterative computation or iterative sets of computations are carried out, each computation corresponding to a single element in the situation. Simulations elements may be linked in that a computation for one element of the simulation may require values from other elements of the simulation, so that data transfer between processes carrying out the simulation is considerable. Monitoring of a system carrying out such a simulation or other computational application can allow identification of not only computing elements which are faulty but also of computing elements which are overloaded and/or consume excessive amounts of energy. However, once a computing element has failed it may be impossible to recover the data.
Computationally intense applications are usually carried out on high performance computer systems. Such high performance computer (HPC) systems often provide distributed environments in which there is a plurality of processing units or cores each with its own individual memory and on which processing threads of an executable can run autonomously in parallel.
Many different hardware configurations and programming models are applicable to high performance computing. A popular approach to high-performance computing currently is the cluster system, in which a plurality of nodes each having one or more multicore or single core processors (or “chips”) are interconnected by a high-speed network. Each node is assumed to have its own area of memory, which is accessible to all cores within that node. The cluster system can be programmed by a human programmer who writes source code, making use of existing code libraries to carry out generic functions. The source code is then compiled (or compiled and then assembled) to lower-level executable code. The executable form of an application (sometimes simply referred to as an “executable”) is run under supervision of an operating system (OS).
The latest generation of supercomputers contain hundreds of thousands or even millions of cores. The three systems on the November 2012 TOP500 list with sustained performance over 10 Pflop/s contain 560,640 (Titan), 1,572,864 (Sequoia) and 705,024 (K computer) cores. In moving from petascale to exascale, the major performance gains will result from an increase in the total number of cores in the system (flops per core is not expected to increase) to 100 million or more. As the number of nodes in the system increases (and especially if low-cost, low-energy nodes are used to maintain an acceptable power envelope) the mean-time-to-component-failure of the system will decrease—eventually to a time shorter than the average simulation run (or other application execution) on the system. Hence, it will be necessary for monitoring of exascale software to be resilient to component failure.
The general principle for fault-tolerant provision of data is redundant storage of data to ensure that in the event of a fault, the data is still available from elsewhere. This principle is used in RAID (Redundant Array of Independent Discs), and could be used in conjunction with iSER (iSCSI extensions for RDMA, Remote Direct Memory Access) for data retrieval.
RAID is an umbrella term for computer data storage schemes that can divide and replicate data among multiple physical drives, such as discs. The array of discs can be accessed by the operating system as one single disc. Effectively, this technology primarily addresses large files which benefit from “striping” across discs. This method of “striping” files across discs can be used to aid fault-tolerant data provision. iSER is a computer network protocol that extends the internet small computer system interface (iSCSI) protocol to use RDMA. It permits data to be transferred directly into and out of SCSI computer memory buffers without intermediate data copies.
Remote Direct Memory Access is a technology allowing a computing element to use its network interface controller (or other network access mechanism) to transmit information via the network to modify the storage at a second computing element. This technology is important in high performance computing, where the computing elements may be part of a supercomputer, as it reduces the work placed on the processor of the computing element. RDMA technology is also beneficial to a network-on-chip processor as a computing element in the network is able to modify storage local to a second computing element in a way that minimizes the work placed on the second computing element.
RDMA relies on single-sided communication, also referred to as “third-party I/O” or “zero copy networking”. In single-sided communication, to send data, a source processor or initiator (under control of a program or process being executed by that processor) simply puts that data in the memory of a destination processor or target, and likewise a processor can read data from another processor's memory without interrupting the remote processor. Thus, the operating system of the remote processor is normally not aware that its memory has been read or written to. The writing or reading are handled by the processors' network interface controllers (or equivalent, e.g. network adapter) without any copying of data to or from data buffers in the operating system (hence, “zero copy”). This reduces latency and increases the speed of data transfer, which is obviously beneficial in high performance computing.
Consequently, references in this specification to data being transferred from one computing element or node to another should be understood to mean that the respective network interface controllers (or equivalent) transfer data, without necessarily involving the host processing units of the nodes themselves.
Conventional RDMA instructions include “rdma_put” and “rdma_get”. An “rdma_put” allows one node to write data directly to a memory at a remote node, which node must have granted suitable access rights to the first node in advance, and have a memory (or buffer) ready to receive the data. “rdma_get” allows one node to read data directly from the memory (or memory buffer) of a remote node, assuming again that the required privileges have already been granted.
It is desirable to provide monitoring for network computing elements which is fault-tolerant.