When procuring a computer system in a business environment, an important factor considered is the availability of the computer to perform/operate. This can affect profitability as well as work/job performance. There are four basic design concepts used alone or in combination to improve availability.
One design technique is commonly referred to as "fault tolerant." A computer system employing this technique is designed to withstand a hard fault that could shut down another type of computer system. Such a design typically involves replicating hardware and software so an applications program is running simultaneously in multiple processors. In this way, if a hard fault occurs in one processor or subsystem, the application program running in the other processor(s)/subsystem(s) still provides an output. Thus, as to the user, the computer system has performed its designated task. In addition to multiple processors, a voting scheme can be implemented, whereby the outputs from the multiple processors are compared to determine the correct output.
Fault tolerant systems are complex, essentially require multiple independent processing systems and, as such, are very expensive. Further, although the system is fault tolerant, once a fault occurs it is necessary for a service representative to arrive on site, diagnosis and repair the faulted path/sub-system. This makes maintenance expensive.
Another technique, involves designing components such that they are highly reliable and, therefore, unlikely to fail during an operational cycle. This technique is common for space, military and aviation applications where size and weight limitations of the intended use (e.g., a satellite) typically restrict the available design techniques. Highly reliable components are typically expensive and also make maintenance activities expensive to maintain these design characteristics.
Such expenses may make a computer system commercially unacceptable for a given application. In any event, once a system has a failure, a service representative must be dispatched to diagnosis and repair the failed system. When dealing with military/aviation applications, the vehicle/item housing the failed component must be brought to a repair facility. However, until the system is repaired it is unavailable. As such, this increases maintenance costs and makes such repairs/replacement activities critical path issues.
A third technique involves clustering multiple independent computer systems together such that when one computer system fails, its work is performed by any one of the other systems in the cluster. This technique is limited to those applications where there are, or there is a need for, a number of independent systems. It is not usable for a stand alone system. Also, in order for this type of system to work each independent computer system must be capable of accessing the data and application program of any of the systems in the cluster. For example, a central data storage device (e.g. hard drive) is provided that can be accessed by any of the computer systems. In addition to the limited applicability, the foregoing is complex, expensive and raises data security issues.
A fourth technique involves providing redundant power supplies and blowers. Thus, the failure of a blower or power supply does not result in shutdown of the computer system. However, providing redundancy for other computer systems components is not viable because a service representative must be brought in to diagnosis the cause of failure so the machine can be repaired and returned to operability.
The fourth technique also has included providing a computer system with a mechanism to automatically re-boot the system following a system crash or hang. This technique may allow recovery from transient problems, however, there is no diagnosing done in connection with restoring the system to operability. Thus, if the system is faulted a service representative must be brought in to diagnosis the cause of failure so the machine can be repaired and restored to operation.
As such, there is a need for a computer system that can automatically recover from a large percentage of the potential failure modes (i.e., recover without requiring operator/service rep. action). In particular, there is a need for a methodology that involves self-diagnosis by a computer of its and its components' functionally, as well as a computer being capable of de-configuring/re-configuring system hardware to isolate the failed component(s). Thus, allowing the computer to automatically continue system operation albeit in a possibly degraded condition. There also is a need for a computer system having such high availability design characteristics.