In conventional computer systems, when a system fails, technicians may examine log files to diagnose the problem, after the problem occurs. Conventional fault-tolerant systems may include methods for diagnosing faults after a component fails, while preventing system failure from being caused by the component failure. For example, conventional fault-tolerant systems may include pair and spare systems, where two duplicated components run in lock step, receiving the same inputs. When the outputs from the pair of components differ, one of the components of the pair is known to have failed, although not which one, and both components are shut down and replaced by a spare, possibly without any human intervention. Alternatively, three components may be used that run in lock step, receiving the same inputs. When one of the outputs from the three components differs from the other two, the component that differs is considered to have failed, and may be replaced.
Redundancy and failover mechanisms may be employed which reduces downtime if a primary system fails. A system may be configured in an N+1 or N+i configuration with hot and/or cold standbys. If a primary system fails, the standby system becomes the primary. The amount of downtime caused by such an occurrence may depend on how quickly the system can be failed over to the standby and on how closely the standby was synchronized with the primary system which has failed. Currently, in telephone communication systems, it generally takes a few seconds to fail over a failed system and restore service after the failure is detected. The telephone communication OEMs (Original Equipment Manufacturers) are seeking lower downtime in their systems.
Individual components in a system may also be fault-tolerant. For example, error correcting codes may correct faults, which occur in a memory. When these faults are successfully corrected, they may be invisible to the system as a whole. When these faults continue to build up without being detected or corrected, a system failure may occur. System downtime may be needed for replacing the memory chip.
An increased frequency of correctable errors may suggest that an uncorrectable failure is imminent, or at least that the risk of such a failure has increased. Predicting component failures before they occur may reduce the chance of system failure and the resultant system downtime. Predicting component failures before they occur may also allow maintenance to be performed more efficiently.
Conventional fault handling systems are generally “reactive” in nature. In other words, after a fault happens, an alert is triggered, and fail over is achieved to a known good system, after which diagnosing the problem can begin. As the demand for more and more uptime increases for all applications like e-commerce, electronic trading, etc., the system design challenges becomes almost insurmountable with the reactive failover architectures. In a cost conscious environment when lockstep methods may not be cost justifiable, this reactive mode of fault handling is not sufficient to meet these requirements.