(a) Field of the Invention
The present invention relates to a multi-processor system and, more particularly, to an improvement of the processing for recovering from a failure in the multi-processor system.
(b) Description of the Related Art
In a recent multi-processor system, especially in an open multi-processor system such as running thereon Windows and Unix (trade marks), there is a tendency for enhancing the remote access service (RAS) functions of the platform for controlling the system configurations, processing for error logging and recovery from a failure in association with the operating system, drivers and applications.
In the mean time, the system platform of the multi-processor system is increased in the scale thereof to meet diversification of the use needs, whereby there is also a demand for separating the multi-processor system into a plurality of partitions each capable of allowing independent system operation and running thereon a plurality of operating systems.
In the circumstances as described above, it is expected in the near future that a large-scale multi-processor system is separated into a plurality of partitions each meeting the requirements of the functions by which resources can be flexibly added thereto or removed therefrom depending on the loads in each of the partitions, and by which the failed resources can be immediately and automatically replaced with backup resources provided for this purpose in the system. It is also expected that the needs for a consolidated platform are increased wherein a plurality of multi-processor systems are consolidated to reduce the system costs.
It is generally important in a multi-processor system to deal with precise recovery from the system failure. Patent Publication JP-A-2001-134546, for example, describes a technique for processing of recovery from a failure in a multi-processor system wherein a single service processor controls a plurality of nodes.
However, the above publication is silent to the control of a consolidated multi-processor system having a plurality of node groups each including a plurality nodes, wherein a plurality of nodes belonging to different groups are selected to form an independent system. In such a system, the failure may extend over a plurality of node groups, and thus it is not assured to recover from the failure by using the technique described in the publication.
In view of the above problem of the conventional technique, it is an object of the present invention to provide a large-scale multi-processor system which is capable of immediately and assuredly recovering from a failure, the large-scale multi-processor system including a plurality of node groups, each of which includes a plurality of nodes and a service processor for controlling the plurality of nodes.
It is another object of the present invention to provide a method used in such a large-scale multi-processor system.
The present invention provides, in one aspect thereof, a multi-processor system including: a plurality of node groups each including a plurality of nodes and a service processor for managing the plurality of nodes; a service processor manager for managing the service processors of the plurality of node groups; a network for interconnecting the plurality of nodes of the plurality of node groups, and a partition including a selected number of is nodes selected from the plurality of nodes of the plurality of node groups, wherein: a failed node among the selected number of nodes transmits failure information including occurrence of a failure to a corresponding service processor, which prepares first status information of the failed node based on error log information of the failed node and transmits the first status information to the service processor manager; the failed node transmits failure notification data including the failure information to other nodes of the selected number of nodes; the other nodes transmit the failure information to respective the service processors, which prepare second status information based on error log information of the other nodes and transmit the second status information to the service processor manager; and the service processor manager identifies a location of the failed node based on the first and second status information to indicate the service processors in the partition to recover from the failure.
The present invention also provides a method for recovering from a failure in a multi-processor system including: a plurality of node groups each including a plurality of nodes and a service processor for managing the plurality of nodes; a service processor manager for managing the service processors of the plurality of node groups; a network for interconnecting the plurality of nodes of the plurality of node groups, and a partition including a selected number of nodes selected from the plurality of nodes of the plurality of node groups, the method including the steps of: transmitting failure information including occurrence of a failure from a failed node among the selected number of nodes to a corresponding service processor, thereby allowing the corresponding service processor to prepare first status information of the failed node based on error log information of the failed node and transmit the first status information to the service processor manager; transmitting failure notification data including the failure information from the failed node to other nodes of the selected number of nodes; transmitting the failure information from the other nodes to respective the service processors, thereby allowing the service processors to prepare second status information based on error log information of the other nodes and transmit the second status information to the service processor manager; and allowing the service processor manager to identify a location of the failed node based on the first and second status information and indicate the service processors in the partition to recover from the failure.
In accordance with the method and system of the present invention, since the service processor manager receives error log information of the respective nodes from the service processor managing the failed node and the service processors managing the other nodes belonging to the partition to which the failed node belongs, the service processor manager can correctly identify the location and state of the failure and thus allow the system to quickly and assuredly recover from the failure.
The above and other objects, features and advantages of the present invention will be more apparent from the following description, referring to the accompanying drawings.