Not applicable.
1. Field of the Invention
The present invention generally relates to a multi-processor computer system. More particularly, the invention relates to fault isolation in a multi-processor computer system.
2. Background of the Invention
As the name suggests, multi-processor computer systems are computer systems that contain more than one microprocessor. Data can be passed from one processor to another to another in such systems. One processor can request a copy of a block of another processor""s memory. As such, memory physically connected to or integrated into one processor can be shared by other processors in the system. A high degree of shareability of resources (e.g., memory) generally improves system performance and enhances the capabilities of such a system.
Resource sharing in a multi-processor computer system, although advantageous for performance, increases the risk of a data error propagating through the system and causing widespread harm in the system. For example, multiple processors may need a copy of a data block from a source processor. The requesting processors may need to perform an action dependent upon the value of the data. If the data becomes corrupted as it is retrieved from the source processor""s memory (or may have become corrupted when it was originally stored in the source processor), the requesting processors may perform unintended actions. Hardware failures in one processor or logic associated with one processor may cause corruption or failures in other parts of the system. Accordingly, techniques for fault containment are needed.
Several fault isolation techniques have been suggested. One suggestion has been to allow controlled memory sharing in a system that is page-based and that relies on a processor with precise memory faults. Such a page-based technique is relatively complex to implement. Although acceptable in that context, a need still exists to isolate faults in a computer system that is easier to implement than a page-based technique. Further, it would be desirable to have an isolation strategy that works in a multi-processor system in which the processors do not have precise memory exceptions. Despite the advantages such a system would provide, to date no such system is known to exist.
The problems noted above are solved in large part by a multi-processor computer system that permits various types of partitions to be implemented to contain and isolate hardware failures. The various types of partitions include hard, semi-hard, firm, and soft partitions. Each partition can include one or more processors. Upon detecting a failure associated with a processor, the connection to adjacent processors in the system can be severed, thereby precluding corrupted data from contaminating the rest of the system.
If an inter-processor connection is severed, message traffic in the system can become congested as messages become backed up in other processors. Accordingly, the preferred embodiment of the invention includes various timers in each processor to monitor for traffic congestion that may be due to a severed connection. Rather than letting the processor continue to wait to be able to transmit its messages, the timers will expire at preprogrammed time periods and the processor will take appropriate action, such as simply dropping queued messages, to keep the system from locking up. Each processor preferably includes individual timers for different types of messages (e.g., request, response). These and other advantages will become apparent upon reading the reviewing the following description.