Today computers, networks, or clusters of computers, are used for all types of applications. In order for these computers to be utilized efficiently and to their maximum capacity, it is important not only that the jobs scheduled for execution are scheduled efficiently, but also that the jobs be checkpointed judiciously in case that they are interrupted by computer failures to avoid rerunning them from scratch. A checkpoint is a copy of the computer's memory that is periodically saved on disk along with the current register settings. In the event of any failure, the last checkpoint serves as a recovery point. In long-running scientific applications, with runtimes of the order of weeks to months, checkpointing schemes are crucial for providing reliable performance. The checkpointing interval is a feature of the applications, not of the system. Presently, applications request checkpoints in a quasi-periodic manner, independent of the system health or availability. For computers with more than one node or multi-node systems, the checkpoint overhead increases linearly with the number of nodes.
Authors of long-running scientific applications typically use checkpointing to help recover from failures. However, it is often difficult or awkward to set the right checkpoint interval, because checkpointing depends on system parameters such as the mean time between failures. These failures may include hardware memory problems such as cache parity errors, or network problems such as failed communication between ports. Further, these failures are also time stamped. By considering the mean time between failures and other system parameters, checkpoints should be introduced in a way, such that they are appropriate for a particular system. Ideally, checkpoints should be placed wherever they are cheapest and fastest to perform, according to the determination of the application designer, and this placement should be made without regard for the particulars of the system.
As mentioned above, current checkpointing procedures for any type of computer or computer clusters are typically initiated by the long running applications. These checkpoints are requested by applications at times during their execution, when the application state is minimal, often between iterations of loops. Even if checkpointing methods are to minimize the loss of application running time due to system failures (which may be any type of hardware or software failure leading to the termination of the application), there is no link between the checkpointing interval, or when to checkpoint and system health or availability.
With current procedures for checkpointing, there is no knowledge about the behavior of the nodes when an application runs. Further, there is uncertainty as to whether a node will fail while an application is running, experiencing too many errors, or requiring a restart. Thus, without knowledge of the behavior of the nodes, more frequent checkpoints must be provided to account for any such failures or errors to avoid loss of application running time. For example, if a customer needs a specific application (such as protein folding) to be completed within a specified time, lack of knowledge of the behavior of the nodes forces the application developer to provide more frequent checkpoints to ensure that the application would be completed within a reasonable time, and the system has to accept these excessive checkpoint overheads. Therefore, there is a need to determine or forecast the behavior of nodes so that the system can make an intelligent decision on when to skip a checkpoint requested by the application.
A currently pending patent application Ser. No. 10/720,300, assigned to the same assignee as that of the instant application and incorporated herein by reference, discloses a failure prediction mechanism that determines the probability of the occurrence of failure of the nodes. This determination can be used to compare the probable node down time with the checkpoint overhead to decide whether to include a requested checkpoint or to skip it.
Referring to FIG. 1 we show a block diagram of a computer system 100 using a known checkpointing method. The computer system 100 comprises a plurality (in this case, six) of processing nodes (N1-N6) 105, wherein each node represents one or more processors. FIG. 1 also shows an operating environment 103 (i.e., the operating system, such as AIX or Windows), a health monitoring unit 104, partitions 110, and disk or storage system 102, as primarily affected by checkpointing. The health monitoring unit 104 is commonly known as a problem log unit, which receives information, for example, on failures such as those described above. Each partition can comprise a varying number of nodes according to the application requirements. Each partition is also known as an application running environment. Applications running in an application environment 110 decide the checkpointing interval (I) and are aware of the checkpointing overhead (C). Once the checkpointing time approaches, checkpointing is triggered by the applications from an application environment 110 and the instruction goes to operating environment 103 so that the operating environment 103 gives the instruction to start checkpointing or writing to disk or storage system 102.
Computer systems have one or more health monitoring units 104 for keeping a record or records of all the health-related information for all the nodes 105 as well as the operating environment 103. The health monitoring unit 104 also optionally includes a hardware diagnostics monitoring unit 106 providing health related information for all the hardware components. Known health monitoring units 104 do not have any direct interaction with the checkpointing or backing up mechanism with the hard disk or storage systems 102.
FIG. 2A is block diagram for known independent system software and hardware units within a computer system 200. This figure shows the different existing units with respect to checkpoint or backup mechanisms for computer systems.
A control environment 108 is the central control authority which receives health and diagnostic information from the health monitoring unit 104 and hardware diagnostic unit 106, respectively. The control environment 108 can also receive user specified information from other units 111.
FIG. 2B shows a simple flow diagram for a known checkpointing flow mechanism working independently from the health monitoring unit 104. The application environment 110 instructs the Operating environment 103 to start the checkpointing by writing the data to the disk or storage system 102 at the specified interval I and takes a checkpoint overhead of time C.
FIG. 2C is a flow diagram for the conventional system health monitoring unit 104 in case of a computer system. The health monitoring unit 104 is provided with the hardware diagnostics information from the hardware diagnostics unit 106. The hardware diagnostic information may include memory and communication problems described above. Similarly, the operating environment 103 provides all other software and other environmental health monitoring parameters, such as high temperature of hardware units, fan failures, or other conditions that are known to lead to system failures. Then, the diagnostic information and environmental health parameters communicated to the user through the control environment.
In conventional systems there is no connection between the health monitoring units and the checkpointing units without the human intervention or by means of a system administrator. Therefore there is a need for a system and method for checkpointing applications that uses a connection between the health monitoring units and the checkpointing units without the human intervention.