This invention relates to initializations of computer systems and, more particularly, to warm starts.
When a computer system is turned on, it initiates what is called a cold start. A cold start involves initializing the software and the hardware to an initial state. Every time the cold start happens, the system is initialized to the exact same initial state.
A warm start, in contradistinction, can happen only after the system is up and running. When a warm start happens, the hardware is not reinitialized as in a cold start. In general, the hardware is checked to make sure that it is in a sane state. Also, the software is not initialized after a warm start but, rather, the software checks itself for consistency and sanity and attempts to correct any encountered problems. What is important is that the rest of the system software (drivers, platform, and application software) keeps the system state across the warm start.
Warm start is a technique used when providing redundancy and fault tolerance for a particular computing machine. When a critical/unrecoverable error happens, such system attempts a warm start that would clear the error without disrupting the established services. For example, if one is talking over the phone to someone else and a warm start happens on one of the switches that are carrying such call. Both parties in the phone conversation should not notice it. They keep talking to each other without service interruption.
A warm start can be controlled or uncontrolled. In a controlled warm start, a process running on the controller might decide that the system is too unstable to keep running as it is and that warm start should be performed in order to attempt to correct the errors. Alternatively, there is the uncontrolled warm start. For example, if a system has primary controller and standby (redundant) controller, when the primary controller goes bad the secondary controller becomes active. This transition can be achieved through the use of the warm start technique (among other choices). If, however, the primary controller suddenly fails, or a user physically takes out the primary controller, that would force the standby controller to undergo warm start. In this case, the software did not have any control on when the warm start happened, and such a warm start is called an uncontrolled warm start.
Thus, a warm start is a start that is that does not reset, or initialize, all variables. If the warm start is not done carefully, the system might crash, thus defeating the success of the warm start, since the service will be disrupted (in previous phone call example: both parties will all of sudden get disconnected from each other and the phone call is terminated).
In accordance with the principles disclosed herein, when a warm start is initiated, control passes to a managing task. The managing task disables all interactions of the switch controller with other switches or routers, informs the switch""s I/O modules that a warm start has been initiated, and proceeds with a two phase boot-up. In the first phase, each process of the controller checks its own internal data structures to make sure that they are consistent. The checks are made seriatim by means of a token that the managing task circulates. During this phase a process is not allowed to talk to any other process except the managing task. When a process gets the token it does its own internal checking, then returns the token to the managing task. If it is determined that there are inconsistencies in a data structure of a process, the process tries to fix them. If that cannot be done, it is concluded that the error is unrecoverable and a cold start is initiated. During this phase, the system checks only for very critical errors that, if not recoverable, the system cannot continue the warm start. The second phase, which follows a successful completion of the first phase, checks are again done on the processes, seriatim, by means of a token that the managing task circulates. In this phase each process makes sure that any entity that it is managing is an acceptable state, in the sense that images of the entity across all processes are consistent. If it is not, then the process deletes the entity from the system.