Most general purpose computer systems are built around a general-purpose processor, which is typically an integrated circuit operable to perform a wide variety of operations useful for executing a wide variety of software. The processor is able to perform a fixed set of instructions, which collectively are known as the instruction set for the processor. A typical instruction set includes a variety of types of instructions, including arithmetic, logic, and data instructions.
In more sophisticated computer systems, multiple processors are used, and one or more processors run software that is operable to assign tasks to other processors or to split up a task so that it can be worked on by multiple processors at the same time. In such systems, the data being worked on is typically stored in memory that is either centralized or is split up among the different processors working on a task.
Instructions from the instruction set of the computer's processor or processors that are chosen to perform a certain task form a software program that can be executed on the computer system. Typically, the software program is first written in a high-level language such as “C” that is easier for a programmer to understand than the processor's instruction set. A program called a compiler converts the high-level language program code to processor-specific instructions.
In multiprocessor systems, the programmer or the compiler will usually look for tasks that can be performed in parallel, such as calculations where the data used to perform a first calculation are not dependent on the results of certain other calculations such that the first calculation and other calculations can be performed at the same time. The calculations performed at the same time are said to be performed in parallel and can result in significantly faster execution of the program. Although some programs such as web browsers and word processors don't consume a high percentage of even a single processor's resources and don't have many operations that can be performed in parallel, other operations such as scientific simulation can often run hundreds or thousands of times faster in computers with thousands of parallel processing nodes available.
The program runs on multiple processors by passing messages between the processors, such as to share the results of calculations, to share data stored in memory, and to configure or report error conditions within the multiprocessor system. In more sophisticated multiprocessor systems, a large number of processors and other resources can be split up or divided to run different programs or even different operating systems, providing what are effectively several different computer systems made up from a single multiprocessor computer system.
Programmers and system operators often use system tools such as debuggers that enable tracking execution of a software program to confirm proper program execution. But, when system tools such as a debugger or checkpoint/restart tool are applied to a program distributed across a large number of nodes, placement and management of the system tools is a concern.