1. Field of the Invention
This invention relates to the field of multiprocessor computer systems with built-in redundancy, and more particularly, to systems and methods for testing redundant functional components during normal system operation.
2. Description of the Related Art
Multiprocessor computer systems include two or more processors which may be employed to perform computing tasks. A particular computing task may be performed upon one processor while other processors perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among multiple processors to decrease the time required to perform the computing task as a whole. Generally speaking, a processor is a device that executes programmed instructions to produce desired output signals, often in response to user-provided input data.
A popular architecture in commercial multiprocessor computer systems is the symmetric multiprocessor (SMP) architecture. Typically, an SMP computer system comprises multiple processors each connected through a cache hierarchy to a shared bus. Additionally connected to the shared bus is a memory, which is shared among the processors in the system. Access to any particular memory location within the memory occurs in a similar amount of time as access to any other particular memory location. Since each location in the memory may be accessed in a uniform manner, this structure is often referred to as a uniform memory architecture (UMA).
Another architecture for multiprocessor computer systems is a distributed shared memory architecture. A distributed shared memory architecture includes multiple nodes that each include one or more processors and some local memory. The multiple nodes are coupled together by a network. The memory included within the multiple nodes, when considered as a collective whole, forms the shared memory for the computer system.
Distributed shared memory systems are more scaleable than systems with a shared bus architecture. Since many of the processor accesses are completed within a node, nodes typically impose much lower bandwidth requirements upon the network than the same number of processors would impose on a shared bus. The nodes may operate at high clock frequency and bandwidth, accessing the network only as needed. Additional nodes may be added to the network without affecting the local bandwidth of the nodes. Instead, only the network bandwidth is affected.
Because of their high performance, multiprocessor computer systems are used for many different types of mission-critical applications in the commercial marketplace. For these systems, downtime can have a dramatic and adverse impact on revenue. Thus system designs must meet the uptime demands of such mission critical applications by providing computing platforms that are reliable, available for use when needed, and easy to diagnose and service.
One way to meet the uptime demands of these kinds of systems is to design in fault tolerance, redundancy, and reliability from the inception of the machine design. Reliability features incorporated in most multiprocessor computer systems include environmental monitoring, error correction code (ECC) data protection, and modular subsystem design. More advanced fault tolerant multiprocessor systems also have several additional features, such as full hardware redundancy, fault tolerant power and cooling subsystems, automatic recovery after power outage, and advanced system monitoring tools.
For mission critical applications such as transaction processing, decision support systems, communications services, data warehousing, and file serving, no hardware failure in the system should halt processing and bring the whole system down. Ideally, any failure should be transparent to users of the computer system and quickly isolated by the system. The system administrator must be informed of the failure so remedial action can be taken to bring the computer system back up to 100% operational status. Preferably, the remedial action can be made without bringing the system down.
In many modem multiprocessor systems, fault tolerance is provided by identifying and shutting down faulty processors and assigning their tasks to other functional processors. However, faults are not limited to processors and may occur in other portions of the system such as, e.g., interconnection traces and connector pins. While these are easily tested when the system powers up, testing for faults while the system is running presents a much greater challenge. This may be a particularly crucial issue in systems that are xe2x80x9chot-swappablexe2x80x9d, i.e. systems that allow boards to be removed and replaced during normal operation so as to permit the system to be always available to users, even while the system is being repaired.
Further, some multiprocessor systems include a system controller, which is a dedicated processor or subsystem for configuring and allocating resources (processors and memory) among various tasks. Fault tolerance for these systems may be provided in the form of a xe2x80x9cback-upxe2x80x9d system controller. It is desirable for the primary and redundant system controllers to each have the ability to disable the other if the other is determined to be faulty. Further, it is desirable to be able to test either of the two subsystems during normal system operation without disrupting the normal system operation. This would be particularly useful for systems that allow the system controllers to be hot-swapped.
Accordingly, there is disclosed herein a multiprocessor system that employs an apparatus and method for caging a redundant component to allow testing of the redundant component without interfering with normal system operation. In one embodiment the multiprocessor system includes at least two system controllers and a set of processing nodes interconnected by a network. The system controllers allocate and configure system resources, and the processing nodes each include a node interface that couple the nodes to the system controllers. The node interfaces can be individually and separately configured in a caged mode and an uncaged mode. In the uncaged mode, the node interface communicates information from either of the system controllers to other components in the processing node. In the caged mode, the node interface censors information from at least one of the system controllers. When all node interfaces censor information from a common system controller, this system controller is effectively xe2x80x9ccagedxe2x80x9d and communications from this system controller are thereby prevented from reaching other node components. This allows the caged system controller along with all its associated interconnections to be tested without interfering with normal operation of the system. Normal system configuration tasks are handled by the uncaged system controller. The uncaged system controller can instruct the node interfaces to uncage the caged system controller if the tests are successfully completed.