1. Field of the Invention
The present invention generally relates to a method and system for handling duplicate or invalid service processor identifications (IDs) in a distributed service processor environment.
2. Description of the Related Art
A large computer (system) can contain a Service Processor (SP). The SP is an embedded computer that manages the system. The SP typically initializes and configures various system hardware, initiates the Initial Program Load (IPL) for the system, reports error and event logs to a Hardware Management Console (HMC), controls firmware update for the system, and continuously monitors the system health.
High End systems can contain a number of SPs connected by a network (e.g., an Ethernet) including a pair of redundant System Controllers (SCs) that perform management of the whole system, and a number of pairs of redundant Node Controllers (NCs) that control devices in the hardware subsystem in which they are located.
A pair of redundant SPs includes a Primary SP and a Backup SP. The primary SP carries out most of the tasks and the backup SP is available to fail-over to in the event that the primary SP fails. It can be seen that in a system with 8 nodes, there can be 18 separate SPs. Two NCs in each node and a single pair of SCs.
In a typical implementation, a software component on each SC keeps track of which NCs are present and handles communication to the NCs. The NCs are uniquely identified by a number called the Node Controller ID (NCID) which encodes the NC's location (e.g. NC4A (node 4, position A)). Other software components on an SC can communicate with a particular NC by calling functions with a parameter specifying the NCID. The NCs flag themselves as present by repeatedly sending NC Present Messages (NPMs) containing their NCID to both SCs until a response is received from each SC. An NC finds out its NCID by reading the status of some hardware pins on the system backplane which are hardwired to reflect the NC location.
A problem occurs when there is a fault with the hardware pins that reflect the NC position. The problem may be a misplug of the NC in the backplane, a short circuit, or a bent pin. The fault results in the NC getting its location information and (NCID) wrong. This can result in either the SC seeing an NC with an Invalid NCID (e.g., an NCID outside the range of normal IDs) or a Duplicate NCID (e.g., an NCID that is already known about). The SCs must have a way of dealing with these NCs.
A standard solution may cope with this problem by simply ignoring an NC that sends an NPM with an Invalid or Duplicate NCID. This approach has two main problems. Firstly, the NC will continue to send NPMs which are ignored (i.e., filling up trace tables and consuming processor's space). Secondly, it is difficult to extract debug data from the NC to solve the problem because communication has not been initiated with the NC.