The invention relates to storing information in response to a fault occurring in a parallel processing system.
Software in a computer system may be made up of many layers. The highest layer is usually referred to as the application layer, followed by lower layers that include the operating system, device drivers (which usually are part of the operating system), and other layers. In a system that is coupled to a network, various transport and network layers may also be present.
During execution of various software routines or modules in the several layers of a system, errors or faults may occur. Such faults may include addressing exceptions, arithmetic faults, and other system errors. A fault handling mechanism is needed to handle such faults so that a software routine or module or even the system can shut down gracefully. For example, clean-up operations may be performed by the fault handling mechanism, and may include the deletion of temporary files and freeing up of system resources. In many operating systems, exception handlers are provided to handle various types of faults (or exceptions). For example, exception handlers are provided in WINDOWS(copyright) operating systems and in UNIX operating systems.
Software may be run on single processor systems, multiprocessor systems, or multi-node parallel processing systems. Examples of single processor systems include standard desktop or portable systems. A multiprocessor system may include a single node that includes multiple processors running in the node. Such systems may include symmetric multiprocessor (SMP) systems. A multi-node parallel processing system may include multiple nodes that may be connected by an interconnect network.
Faults may occur during execution of software routines or modules in each node of a multi-node parallel processing system. When a fault occurs in a multi-node parallel processing system, it may be desirable to capture the state of each node in the system. A need thus exists for a method and apparatus for coordinating the handling of faults occurring in a system having multiple nodes.
In general, according to one embodiment, a method of handling faults in a system having plural nodes. Includes detecting a fault condition in the system and starting fault handling routine in each of the nodes. Selected information collected by each of the fault handling routines is communicated to a predetermined one of the plural nodes.
Other features and embodiments will become apparent from the following description, from the drawings, and from the claims.