1. Field of the Invention
The present invention relates to a multiprocessor configuration and, particularly, to fault recovery in a multiprocessor system.
2. Description of the Related Art
Many software applications can benefit from being distributed across a plurality of processors. Using multiple processors helps increase the processing capacity of the system and provide resiliency to the application in case a failure occurs in a processing component. Further, partitioning the application functions across sets of processing elements can simplify the design of the system. In order to distribute the processing of a software application across a multiprocessor arrangement, the processors need to communicate with one another.
In conventional multiprocessor configurations, multiple processors can be implemented in a processing group. Such processing groups include an access point, which is linked to each of the processors in the group. For example, the access point may comprise a switching element capable of channeling incoming and outgoing data to and from any of the connected processors in the processing group.
Multiple processing groups can transfer data amongst each other by connecting the access points of the processing groups with communication lines. For example, the access points may be connected in series (i.e., using a daisy chain connection) by the communication lines, thus providing a series connection between the processing groups.
In one particular example, a processing group may be implemented as a circuit pack that plugs into a chassis, or shelf. A plurality of such shelves can be mounted in a cabinet. As described above, the processing groups of each shelf may be connected in series, via the access points to allow the processing groups in the cabinet to communicate with one another.
The total number of processors in a single processing group may be limited by factors including the number of processors per plug-in, the number of plug-in boards per shelf, and the number of shelves per cabinet. To further increase the available processing capacity, multiple cabinets can be connected together into a single communications network.
While such multiprocessor configurations can provide a large number of processors, they also increase the number of potential system failures that can affect performance. Such failures can include the failure of a particular processor, the failure of an entire processing group, and the failure of multiple processing groups successively connected (e.g., resulting from the failure of an entire cabinet) in a multiprocessor configuration. Many of these types of failures can cause some of the surviving components to be isolated from each other, and therefore, unable to communicate with one another.
For example, such isolation may occur when the failure of a processing group may render its access point inoperable. This can result in a discontinuity in the series connection of processing groups. In other words, processing groups connected at one side of the failed processing group in the series connection cannot communication to those processing groups that are connected on the other side.
Thus, the interconnection scheme implemented for the processing groups plays a critical role in the degree to which the system can recover from component, shelf, or cabinet failures. Fault recovery algorithms that are executed for the purpose of detecting such failures and recovering the remaining parts of the system of processors also play a critical part in determining the effectiveness of the system's recovery capabilities.