Pre-existing systems provide the feature of tolerating power outages ranging in duration from small fractions of a second to hours. For the shortest outages, ranging up to tens of milliseconds, the tolerance has been absolute. Totally transparent operation has been provided.
Longer outages are not transparent. No service is provided during the power outage, but recovery (resumption of service) after the outage is relatively fast (typically less than one minute) due to preservation of full memory state and transparent resumption of all processes executing at the beginning of the power outage. This type of tolerance might be thought of as "hibernation" during the outage. Typically, this feature tolerates outages up to approximately two hours.
FIG. #_1 illustrates multiple processor subsystems #_110a, #_110b, . . . , #_110n composing a pre-existing multi-processor system #_100. Each processor subsystem #_110 includes two power supplies, IPS #_120 and UPS #_130; and a lost-memory detection circuit (not shown). Each processor subsystem #_110 also includes its respective processor logic #_140, including a memory #_180 and associated memory control logic #_190; a maintenance diagnostic processor (MDP) #_150; I/O controllers #_160; and IPC controllers #_170.
The interruptible power supply (IPS) #_120 supplies power to the processor logic #_140 (excluding the memory #_180 and some of the memory control logic #_190 but including the cache #_1H0, if any), the MDP #_150, and the I/O and inter-processor subsystem communications (IPC) controllers #_160, #_170. The uninterruptible power supply (UPS) supplies power to the memory #_180 and some of the memory control logic #_190.
The UPS #_130 typically includes a battery, such as the battery #_1A0. During normal operation, an alternating current (AC) power source (not shown) drives both the IPS and the UPS #_120, #_130 and charges the battery supply #_1A0. Should the AC power source fail, the battery power supply #_1A0 supplies power to the UPS #_130, thus enabling the UPS #_130 to maintain the contents of the memory #_180 valid during the power outage.
When the power supply control circuitry #_1I0 detects a loss of AC power, it asserts a signal #_1C0 herein termed the power-failure warning signal. This signal #_1C0 connects to the interrupt logic of its respective processor system #_110 so that software notices the loss of AC power via an interrupt.
The capacitance design of the power supply guarantees that the power-failure warning signal #_1C0 occurs at least a predetermined amount of time (5 milliseconds, in one embodiment) before power from the IPS #_120 becomes unreliable. The power supply control circuitry #_1I0 switches the UPS #_130 over to the battery supply #_1A0 and shuts down the IPS #_120 when the IPS #_120 becomes unreliable.
The predetermined time guarantee allows the software to do two things before power is lost. First, the software recognizes the interrupt (even though there may be times when the power-failure warning signal interrupt is masked off, resulting in some delay in recognizing the interrupt). Second, the software saves state as described in more detail below.
Processor subsystems #_110 with no cache or with write-through caches use a first predetermined guaranteed time. However, on processor subsystems #_110 with write-back caches, the time necessary to save cache to the memory #_180 can be substantial. An alternative, larger predetermined guaranteed time is calculated by estimating the worst-case time necessary to save every line of the largest cache.
When AC power returns, the power control circuitry resumes IPS-based operation, starts charging the battery supply #_1A0 again, and asserts a power-on signal. This signal causes the MDP #_150 to reset and bootstrap itself and then to control the resetting of the processor #_140.
The lost-memory detection circuit contains a flip-flop (not shown) to determine whether memory contents are valid after a power outage. The power-supply circuitry explicitly sets (e.g., to logical TRUE) the flip-flop whenever power from the UPS #_130 is restored. The processor subsystem #_110 clears the flip-flop (e.g., sets it to logical FALSE) during power-on processing, after saving its value into a reset control word. This flip-flop retains its value as long as UPS power #_140 is valid.
Boot code receives the reset control word when the processor #_140 is reset. The boot code uses this information to decide whether to initiate automatic power-on recovery when memory contents are valid or wait in a loop for instructions when memory contents were lost.
The power-failure warning signal .pi._1C0 (if not masked) raises a software interrupt, and the software begins executing a power-fail interrupt handler. The interrupt handler immediately stops all I/O activity. (This early action is necessary on systems without DMA I/O capability because the handling of reconnects could cause the state-saving steps described below to proceed too slowly, resulting in a failure to recover from a power outage.)
The main function of the power-fail interrupt handler, however, is to save such processor state as is necessary for resumption of operations after the power outage ends. While all processors (of known design) would save their working (general purpose) registers, different types of processors #_140 save different state. For example, translation lookaside buffer (TLB) entries and I/O Control (IOC) entries both exist in volatile processor state. Processors #_140 with TLB or IOC entries save such state to memory before power is lost.
After saving the necessary state, the power-fail interrupt handler sets a state-saved variable in system global memory to logical TRUE. This variable is initialized to FALSE at cold load or reload time and is also set to FALSE on a power-on event.
Next, the interrupt handler executes a power-fail shout mechanism, described below.
Finally, the interrupt handler executes the code responsible for somewhat gracefully stopping all I/O and IPC traffic and flushing dirty cache contents (if any) to main memory. For example, in the IPC case, both the sending and receiving DMA engines are instructed to finish handling the current packet and then stop operation. The completion status is saved for later use.
(When the network services return to normal operating mode, if the DMA engine was in operation when the power down was performed, then the saved status of that last operation is examined. If that completion was normal, then the DMA engine is restarted with any queued operations. If that completion was an error termination, then the normal error recovery for that operation is performed (except that notification of the client may be deferred because interrupts may be disabled). At the next opportunity for I/O interrupts, the aborted non-inter-processor-subsystem-communications transfers are delivered to network services clients.)
On systems with write-back caches, dirty cache lines are saved to memory as the IPS #_120 supplies the cache with power and thus its contents are not preserved during the outage.
Control then transfers to the software that signals the hardware to fence the external (I/O bus and IPC path) drivers so that garbage is not driven onto these busses when power becomes unreliable.
At this point, the software waits for one of two things to happen. One possibility is that this power outage is either very short or a brown-out. In this case, IPS power does not ever go away. If the IPS #_120 never fails, the power-supply hardware eventually stops asserting the power-failure warning signal.
The software monitors the power-failure warning signal. When the software notices the absence of the power-failure warning, it waits some period of time (50 ms in one embodiment) and then treats this situation exactly like a fresh power-on event as described below.
The other possibility is that the power outage is long enough to cause the IPS #_120 to fail. In this case, the software loops, watching the power-failure warning signal #_1C0 until it ceases to get IPS power. The MDP #_150 restarts automatically when IPS power returns and causes the processor #_140 to restart.
When IPS power is restored, the processor #_140 initializes itself (including, for example, processor caches) and starts executing boot code. It examines the reset control word to determine what kind of reset has occurred. The reset control word contains the value of the lost-memory detection circuit flip-flop, allowing the processor #_140 to determine whether the contents of the memory #_180 are valid. If the reset control word indicates that memory contents are valid, the boot code starts the operating system (preferably, the Nonstop Kernel.RTM., available from the assignee of the instant invention) executing in a power-on interrupt handler.
The power-on interrupt handler completes the restoration of processor state and resumes execution of the interrupted process(es). Before this, however, it checks the state-saved flag. If this flag is FALSE, the power-fail interrupt handler did not have enough time to save state. In this case, the power-on interrupt handler halts the processor. If, however, the state-saved flag is TRUE, the power-on interrupt handler resets it to FALSE (in preparation for the next power outrage).
The power-on interrupt handler wakes up all processes and starts a regroup operation. The regroup synchronizes the processor subsystem #_110 with all of the other processor subsystems #_110.
IO processes (IOP's) are informed of the power-on event for several reasons. First, they may need to download microcode into I/O controllers that lost power during the outage. Second, they may decide to be more tolerant of delays--via longer timeouts or more retries--for a time after the power outage. Such tolerance allows, for example, time for disks to spin up. Alternatively, they may choose to wait until an I/O error has occurred and then inquire about power status.
Third, they may need to cancel or otherwise clean up state for I/O operations that had been started before the power outage.
Two different mechanisms, the power-on interrupt handler and individual I/O controllers, inform IOP's of power-on events. As described above, when a power outage recovery occurs, the power-on interrupt handler wakes up all processes. Any process that waits on this event is thereby notified of the power outage and recovery.
When an individual I/O controller is powered on it generates an appropriate interrupt. The corresponding operating system interrupt handler wakes up the IOP configured to own that I/O controller.
Finally, the power-on interrupt handler gets the program counter from the stack constructed by the power-fail interrupt handler and exits through that address. Execution thus resumes at exactly the point interrupted by the power-failure warning interrupt.
An inter-processor message system provides services for two power fail procedures: the Regroup operation and the power-fail shout mechanism.
Regroup
The Regroup mechanism ensures at various times that all processor subsystems #_110 have the same image of the system, especially which processor subsystems #_110 are part of the system #_100 and which are not. It is invoked whenever the consistency of the system image is in doubt. For example, the Regroup operation is invoked whenever the periodic IAmAlive messages are missing for some time from some processor. It is also invoked at power-fall recovery time by the power-on interrupt handler.
U.S. patent application Ser. No. 08/789,257, entitled, "Method and Apparatus for Distributed Agreement on Processor Membership in a Multi-Processor System," naming Robert L. Jardine et al. as inventors, filed on the same date as the instant application, under an obligation of assignment to the assignee of the instant invention, with Attorney Docket No. 010577-039800US, describes fully the Regroup operation and is, therefore, incorporated by reference herein. The description of the Regroup operation expressly and directly set forth herein is only a loose summary of the operation.
The regroup mechanism proceeds through several phases, broadcasting messages to all other processor subsystems #_110 that were known and agreed to be part of the system image prior to the event that caused regroup to start. The Regroup mechanism results in a new agreement about the system image and a commitment to that image by all of the surviving processor subsystems #_110.
At power-failure recovery time, the Regroup mechanism allows some flexibility in the recovery time for various processor subsystems #_110. If some processor subsystems #_110 were to recover from the power outage much more quickly than others, the slower processor subsystems #_110 would be declared down by the faster processor subsystems #_110.
Regroup has two modes of operation: a normal mode and a cautious mode. In the latter mode, the operation allows more time before ostracizing processor subsystems #_110 from the system #_100. A regroup operation that a power outage initiates operates in cautious mode.
Power-Fail Shout
In cautious mode, the Regroup mechanism generally allows processor subsystems #_110 enough time to recover from a power outage and continue operation. However, if the power outage is very short, the power outage may not be noticed uniformly by all processors and the Regroup mechanism would not normally be invoked in cautious mode.
This anomaly arises from the fact that each power supply #_120 has some capacitance that allows it to handle very short power outages (up to around a few tens of milliseconds in one embodiment) without generating a power-failure warning interrupt at all. These very short power outages are completely transparent to the software. This situation presents a race condition. If the power outage ho duration approximately equals the transparent ride-through capacity of the power supplies #_120, then the normal variations in components and configuration differences (for example, memory size) cause some processor subsystems #_110 to get the power-failure warning signal while others do not. Those processor subsystems #_110 that do not experience the power-failure warning will not know about the power outage, and they will not use the cautious mode of the Regroup mechanism.
If two or more processor subsystems #_110 experience a power-failure warning, there is no problem. The Regroup mechanism executes in cautious mode when more than one processor subsystem #_110 fails to check in. Avoiding the loss of two or more processor subsystems #_110 is important, so the algorithm allows extra time.
However, when only one processor subsystem #_110 is absent, the reasonable assumption is that it has failed, and the detection and recovery from single-processor subsystem failure is required to be relatively quick. Thus, if only a single processor subsystem #_110 experiences the power-failure warning, then the other processor subsystems #_110 regroup without it. Prior to the power-fail "shout" mechanism, such power outages often resulted in the failure of a single processor subsystem #_110.
U.S. patent application Ser. No. 08/265,585, now U.S. Pat. No. 5,928,368 entitled, "Method and Apparatus for Fault-Tolerant Multi-processing System Recovery from Power Failure or Drop Outs," naming Robert L. Jardine et al. as inventors, filed on Jun. 23, 1994, under an obligation of assignment to the assignee of the instant invention, with Attorney Docket No. 010577-031900US, fully describes the power-fail shout mechanism and is, therefore, incorporated fully herein by reference. The description of the power-fail shout mechanism expressly and directly set forth herein is only a loose summary of the mechanism.
The power-fail shout mechanism causes the broadcast of a shout packet from the power-fail interrupt handler to all other processor subsystems #_110. The receipt of this shout packet by other processor subsystems #_110 serves to inform each of the power outage, in case each does not receive a power-fail warning signal. When they learn of the power outage in this way, they execute any subsequent Regroup operation in cautious mode, which allows enough time for a processor subsystem #_110 to fully initialize itself and join in the regroup.
Experience has shown, however, that roughly ninety-five percent of all power outages last less than thirty seconds. Therefore, there is a need for a computer system which can transparently ride through power outages of some predefined duration before going to a memory-hold up model such as described above.
Also, the software of a processor may be able to determine more accurately the state of the power-outage relevant hardware and software, including the state of the backup battery, and from this state may be able to determine how long the processor can function transparently in the face of a power outage before having to switch to the memory hold up model described above. Accordingly, there is a need for a processor with intelligence to approximate its ability to transparently ride through a power outage of indeterminate length and to effect a switch to a memory-hold up model as that ability is exhausted by a continuing power outage.
Further, there is a need for a such processor with additional intelligence to change its state in order to increase its ability to transparently ride through a power outage of indeterminate length.
The power supply hardware consists of multiple components, interruptible and uninterruptible, and logic for switching between the two. Accordingly, there is a need for a less complicated power supply system that nonetheless provides power from an AC source when available and power from a battery when the AC source fails.