The present invention relates to a way of emulating faults that are expected in a computer system, such as a high-performance computer system (HPC system) and which occur while an application is executing. This kind of method is referred to herein as fault injection, which essentially equates to insertion of “artificial” errors in the execution of the application.
The present invention finds application particularly in the field of fault-resilient distributed computing, with emphasis on the testing of new algorithms for use on exascale computers.
Fault-resilient computer programs are required in a wide range of application areas, for instance from simple computations to image rendering and large-scale, complex simulations, including on-the-fly and offline processing. As one important example, mission-critical jobs (e.g. operational weather forecasting) or systems (e.g. the internet) must be resilient to failure. This invention addresses the whole gamut of these application areas.
Computationally intense applications are usually carried out on HPC systems, which often provide distributed environments in which there is a plurality of processing units or cores on which processing threads of an executable can run autonomously in parallel.
Many different hardware configurations and programming models are applicable to high performance computing. A popular approach to high-performance computing currently is the cluster system, in which a plurality of nodes each having one or more multicore processors (or “chips”) are interconnected by a high-speed network. Each node is assumed to have its own area of memory, which is accessible to all cores within that node. The cluster system can be programmed by a human programmer who writes source code, making use of existing code libraries to carry out generic functions, such as hardware control. The source code is then compiled to lower-level executable code, for example code at the ISA (Instruction Set Architecture) level capable of being executed by processor types having a specific instruction set, or to assembly language dedicated to a specific processor. There is often a final stage of assembling or (in the case of a virtual machine, interpreting) the assembly code into executable machine code. The executable form of an application (sometimes simply referred to as an “executable”) is run under supervision of an operating system (O/S) and uses the O/S and libraries to control hardware. The different layers of software used may be referred to together as a software stack.
The term “software stack” as used herein includes all the software required to run an application, including the base level software (the operating system or O/S); libraries interfacing for example with hardware components such as an interconnect between nodes, a disc or other memory etc (also a type of system software) and the application itself. The application currently executing may be seen as the top layer of the software stack, above the system software.
Applications for computer systems having multiple cores may be written in a conventional computer language (such as C/C++ or Fortran), augmented by libraries for allowing the programmer to take advantage of the parallel processing abilities of the multiple cores. In this regard, it is usual to refer to “processes” being run on the cores. A (multi-threaded) process may run across several cores within a multi-core CPU. One such library is the Message Passing Interface, MPI, which uses a distributed-memory model (each process being assumed to have its own area of memory), and facilitates communication among the processes. MPI allows groups of processes to be defined and distinguished, and includes routines for so-called “barrier synchronization”, which is an important feature for allowing multiple processes or processing elements to work together.
Alternatively, in shared-memory parallel programming, all processes or cores can access the same memory or area of memory. In a shared-memory model there is no need to explicitly specify the communication of data between processes (as any changes made by one process are transparent to all others). However, it may be necessary to use a library to control access to the shared memory to ensure that only one process at a time modifies the data.
Exascale computers (i.e. HPC systems capable of 1 exaflop (1018 floating point operations per second) of sustained performance) are expected to be deployed by 2020. Several national projects to develop exascale systems in this timeframe have been announced. The transition from petascale (current state-of-the-art, approximately 1015 flops) to exascale is expected to require disruptive changes in hardware technology. There will be no further increase in processor clock frequency, so the improved performance will result from an increase in parallelism or concurrency (possibly up to approximately 1 billion cores). The requirement to keep the power usage of an exascale system within an acceptable window means that low-power (and low-cost) components are likely to be used, resulting in a reduced mean-time-to-failure for each component. Thus, an exascale system will contain many more components than today's state-of-the-art systems—and each component is likely to fail more frequently than its equivalent today. It is likely that the mean-time-to-component-failure for an exascale system will be measured in minutes (as opposed to days for current systems).
Therefore, exascale software in particular requires the ability to continue to run through component failure, although this is also a requirement for all other systems, especially HPC systems, whether using shared or distributed memory. The development of new algorithms that are capable of doing this is a topic of ongoing research. In order to test these new algorithms robustly it is useful to run them on a present day distributed system and in the presence of faults. As even the largest current systems typically see intervals of days between component failures it is can be appropriate to artificially inject faults in order to carry out this testing.
The need to artificially inject faults is not new and several classes of fault injection techniques exist. These include:                Hardware-based fault injection: achieved at the physical level by altering the environment of the system to make faults more likely, e.g., power supply disturbances, exposure to heavy ion radiation or electromagnetic interference or laser fault injection.        Software-based fault injection: achieved by reproducing the effects of potential hardware failures in software.        Simulation-based fault injection: achieved by creating a model of a potentially faulty system, including a statistical model of the failures that are expected to occur.        Emulation-based fault injection: enhancement to simulation-based fault injection that emulates a faulty system at the circuit level on an FPGA (field-programmable gate array) and then injects these into a host system.        
However, each of these prior art techniques has deficiencies and thus it is desirable to provide an alternative way of achieving fault injection.
According to embodiments of one aspect of the invention there is provided a method of injecting hardware faults into execution of an application in a distributed computing system comprising hardware components including linked nodes, the method comprising loading an enhanced software stack allowing faults to be injected by deactivating or degrading hardware components as a result of fault triggers; running a fault-trigger daemon on each of the nodes; providing the fault trigger for a degradation or deactivation by using one of the daemons to trigger a part of the software stack directly controlling a hardware component to inject a fault in that hardware component; and continuing execution of the application with the injected fault.
Thus invention embodiments use a daemon (a background level program not under user control) for fault injection. This daemon-based methodology can be seen as an intermediate position between the prior art software-based fault injection and the prior art hardware-based fault injection.
In invention embodiments, the software stack loaded is an enhanced software stack which is modified with respect to the standard parts (or libraries) of the software stack to allow faults to be injected. The modification is only to the stack below the application layer.
The faults can comprise deactivation (or switching off) of one or more hardware components or degrading of one or more hardware components, for example by reducing the speed or other performance attribute of the component(s) in question. A single daemon operable to trigger faults runs on each of the nodes and the application continues execution despite the injected fault.
Invention embodiments are implemented in software, but intended to act independently of the application code under evaluation and to have the ability to effectively inject faults directly into hardware to deactivate or otherwise degrade a hardware component. The embodiments can use the parts of the software stack that directly control the hardware to change the way the hardware will respond when the application is executed. Thus, the embodiments are not closely related to simulation- or emulation-based techniques and differ from hardware-based methods in the method of injecting faults and from software-based methods primarily in the location of the faults injected. The main limitations of existing software-based methods are generally a requirement for close integration with the application (either source code modification or changes to the way in which the application is run, e.g. the use of a debugger), an inability to inject the complete range of faults that may be experienced in an exascale system (generally messages may be intercepted and modified, but there is no facility to test complete failure—and potential recovery—of a node) and/or the need to run heavyweight tools across the whole system in a way that will not scale.
Advantageously in some invention embodiments, each daemon runs as a background process on its node within the operating system. Thus the solution is scalable and no source-code modifications are required in the application to be tested.
Thus preferably, the fault is injected completely independently of the application execution, since it is provided separately from the execution, and at a different level in the software stack.
The daemon may operate in any suitable way to trigger the fault. In some embodiments, the enhanced software stack includes an enhanced version of a library for the application and the daemon triggers the enhanced library controlling the hardware to inject the fault. For example, the library may contain standard instructions and one or more alternative instructions for the hardware, which introduce a fault and the alternative instructions may effectively be selected by the daemon to trigger the fault.
As one example, an MPI or interconnect library in the enhanced software stack may include functionality to allow injection of faults via the daemons.
In other embodiments, which may be freely combined with the previous embodiments (so that different faults may be injected in parallel and sequentially), the enhanced software stack includes an enhanced version of the operating system for the application and the daemon triggers the operating system controlling the hardware to inject the fault (potentially a different fault from a fault injected by another trigger).
In such examples, the operating system itself is triggered. Wherever the fault occurs, it is triggered by the daemon, which then interacts with the part of the software stack (for example the modified library or operating system) that controls that hardware in order to alter the behaviour of the hardware.
For the maximum flexibility and to cover all classes of faults, both the operating system and any applicable libraries can be enhanced. However, in some cases a daemon can simply trigger a standard (non-enhanced) operating system to take action (or omit action) which requires no modification to the operating system. Thus the part of the software stack in which a particular fault is injected need not be enhanced for some types of fault. However, other faults will require enhancement of the software stack for injection. Thus the software stack will always be enhanced, to provide a wide range of faults and give a realistic testbed.
Whether the operating system or library is used in the fault trigger may depend on whether the hardware is primarily controlled by the operating system, (which is more likely for parts physically located on the nodes such as the memory or the CPU) or by a library (which is likely to be the case for parts not physically located on the node such as the interconnect and possibly network memory).
There may be some link between the daemons and there may also be some element of central control, but preferably the daemons run independently of each other and independently of any central control. This provides an accurate model of hardware faults. It is also minimally invasive and scalable. Thus each node may only become aware of a fault injected elsewhere by failed communication.
Also, the daemons may be controlled by any suitable method. For example, they may be controlled by one or more files indicating what faults should be injected on the nodes on which they are running. The timing of faults may also be indicated so that the file(s) show what faults are injected when. Each daemon could be sent a different input file or they could all parse a single file for relevant information and thus the need for a central controller is obviated.
The daemon can keep a record of what fault it injects when, particularly if the controlling file or files do not give a timing indication of when faults are to be injected or when there is no controlling file arrangement.
Thus each daemon may determine when a fault occurs, preferably using a statistical model.
One important example of a part of the hardware which may be affected by faults is the interconnect. Thus in addition or alternatively to being able to control the operating system to inject the fault, advantageously each daemon can control an enhanced message interface, such as an enhanced MPI and/or other enhanced interconnect layers to inject the fault.
The daemon need not take any further action in respect of the particular fault after its injection. However in some embodiments, a daemon can provide a recovery trigger after the fault trigger to instruct a recovery of the degraded or de-activated hardware component. Advantageously, the recovery trigger is provided by the daemon after a time delay. For example, the time delay may reflect the delay after which a hardware fault might be automatically resolved, for instance by rebooting of a node, to recreate a fault from which a hardware component may recover.
As a result of the use of the daemons and use of an enhanced software stack in which software below the application level only is modified, the fault injection may be carried out without modification to the source code of the application and without modification to any of the configuration, compilation and execution of the application.
The skilled person will appreciate that the operating system and libraries controlling interconnect layers as well as other parts forming the software stack may be distributed within the system or stored on a physical storage medium or downloaded. Equally, the daemon may be distributed in any way and preferably in the same way as the software stack.
As mentioned above, the enhanced software stack may be provided in any suitable way. In one embodiment, the enhanced software stack is loaded statically or dynamically, for example by a dynamic linker using a modified list of locations, specified, for example by LD_PRELOAD, to search for libraries. LD_PRELOAD is a list of locations used by the dynamic linker of the search libraries. The locations of the enhanced libraries are specified to the linker using LD_PRELOAD.
The purpose of LD_PRELOAD and equivalents is to ensure that the linker looks in these locations first (before the system libraries) allowing the invention to override any functions that are used to control the hardware, no matter when the libraries that contain them would otherwise have been loaded. Further, the libraries specified by LD_PRELOAD need only contain the modified functions for injecting faults. For all other functionalities the application can fall back on the “standard” library, with the location as specified by the standard list of locations LD_LIBRARY_PATH.
According to an embodiment of a further aspect of invention there is provided a distributed computing system comprising hardware components and a software stack allowing a method of injecting hardware faults into an executing application; the distributed computing system comprising: nodes linked to an interconnect; an enhanced version of a software stack for the application, which is operable to allow one or more hardware components to be deactivated or degraded following a fault trigger; and a daemon associated with each single node; each daemon being operable to provide a fault trigger for a degradation or deactivation by triggering a layer of the software stack controlling a hardware component to inject a fault into the hardware component.
In this context, the distributed computing system includes hardware, as well as a software stack currently provided for the hardware.
This aspect refers to a distributed computing system including nodes linked to an interconnect, however the skilled person would appreciate that the method of injecting hardware faults is applicable to any computing system including hardware such as linked nodes. The only requirement of the system is that it is able to act as a testbed for assessing fault resilience of the application.
According to an embodiment of a still further aspect of the invention there is provided a fault-trigger daemon operable on a single node of a distributed computing system comprising hardware components including linked nodes, the computing system being arranged to carry out a method of injecting hardware faults into execution of an application, wherein the daemon is operable to provide a fault trigger for a degradation or deactivation of a hardware component, by triggering a part of the software stack to deactivate or degrade a hardware component that it (the part of the software stack) is controlling.
According to an embodiment of a yet further aspect of the invention there is provided a software stack for use with an application and including an operating system layer and at least one library layer controlling hardware of a distributed computing system comprising hardware components including linked nodes, wherein the library layer and/or operating system are enhanced to allow injection of hardware faults into execution of the application using a fault-trigger daemon operable on a single node of the computing system which provides a fault trigger for a degradation or deactivation of a hardware component.
Individual features and sub-features of each of the aspects may be provided in the other aspects. Thus for example, the preferred method features set out hereinbefore may be applied to the distributed computing system and/or the fault-trigger daemon and/or the enhanced software stack as described above.
The method steps may be carried out in a different order from their definition in the claim and still achieve desirable results. For example, with a dynamic linker, some or all of the software stack can be loaded dynamically during execution of the application and thus timing of this step may be before the daemons are running, or while the daemons are running.