The present invention relates to a method and apparatus for facilitating hardware fault management in a computer system, for example a computer server system.
It is known to provide a service controller in a computer system, for example a computer server system. The service controller can be implemented as a microprocessor, a microcontroller, special purpose logic, etc, separate from the main system processor(s) and is responsible for monitoring and reporting on system operation. The service controller can be responsible, for example, for monitoring environmental conditions (temperature, etc) in the computer system. In addition, or alternatively, the system controller can be responsible for monitoring the system configuration and/or operation. This can include, for example, monitoring and configuring the hardware and software present in the system, for example where the system can include field replaceable units (FRUs). The system controller can also be responsible for monitoring the status of system resources, for example the health of the FRUs, voltage supply levels, fan operating parameters, etc.
In order to increase the reliability of computer systems, for example computer server systems, it is known to provide redundant components, so that if one component fails another like component can take over the functions of the failed component. For example, it is proposed to provide redundant service controllers. As each service controller needs to maintain a record of at the least current system information, there is a need to ensure that the respective records are the same. The process of making them the same is generally termed synchronization. However, the synchronization of the stored system information can involve transferring significant quantities of data.
Accordingly, there is a need for an efficient way of maintaining and synchronizing such system information for a computer system comprising redundant system controllers.