A multi-device system, such as a multi-processor computer system, increases system performance by using the devices in parallel. For example, a computer system can use four processors that simultaneously perform logical operations, creating a system that is much faster than one with a single processor. A multi-device system can also increase system reliability, because when one device fails the remaining devices can keep the system running.
In some multi-processor systems, all processors are started, or "booted," at one time when the system is turned on. In other words, when power is applied to the system all processors begin their normal start-up operations substantially simultaneously. If all of the processors operate normally, the system will be fully operational.
It is possible, however, that one of the processors will not operate normally. Moreover, a single processor can fail in such a way that prevents all of the other processors from operating normally. For example, a processor might place erroneous messages on an inter-processor bus, preventing all of the other processors from communicating. In this case, a single failed processor will cause the entire multi-processor system to fail. This result hampers one of the goals of a multi-processor system, namely, increased system reliability. Because there are multiple processors, and such a failure in any one of the processors can cause the entire system to fail, the reliability of the system can be worse than it would be with a single processor.
In one solution to this problem, less than all of the processors are booted up when a failure is detected. For example, if booting up all processors simultaneously fails to make the system operational, each processor in the multi-processor system can be individually booted up, one at a time, until the system becomes operational. Once the system becomes operational, the process is halted to prevent the failed processor from booting up. However, this reduces the performance benefit gained by operating the processors in parallel. In a four-processor system, such a solution limits the system to a single processor when any one processor fails. The other two processors, which have not failed, remain idle.
In another solution, when a failure is detected each processor is sequentially removed from the configuration, one at a time, while the remaining processors boot up substantially simultaneously. Each processor can be removed until the system becomes operational. In this way, a single failed processor in a four-processor system will result in all three operational processors being used. This approach also has a drawback. If two processors fail, the system will never become operational because only a single processor is removed from the system at any given time.
More complicated boot up schemes can be implemented to avoid these problems. For example, programmable elements, such as Programmable Array Logic (PAL) devices, can be designed to enable various combinations of processors until the optimal working configuration is achieved. These components, however, are expensive and must be specially designed for the system.
It should be noted that although a multi-processor system was used to illustrate the disadvantages of these boot circuit designs, other types of multi-device systems and boot circuit designs suffer from similar problems.
In view of the foregoing, it can be appreciated that a substantial need exists for a fault resilient boot circuit that maintains the benefit of increased performance in a multi-device system, at a reasonable cost, without losing the benefit of increased reliability.