A server system is a computer system used in a mission-critical system, in which firmware, which is software directly controlling hardware in the server system, is installed.
A plurality of CELLs can be mounted in an enterprise server system, and a host operating system (OS) can be made to run in a state where one or more CELLs are bundled together. The CELL means a baseboard corresponding to a motherboard of a personal computer. In order to control the CELL itself, each CELL has a management board (MGMT) thereon, and the firmware (hereinafter referred to as “BMCFW”) operates on the MGMT. The MGMT may be duplicated for continuous system operation against failure occurrence in the MGMT.
FIG. 1 is a block diagram illustrating an example of the CELL mounted in a server system of a background art. FIG. 1 illustrates one CELL 100, and the CELL 100 has one MGMT 101. The MGMT 101 is a management board composed of various hardware components, and some hardware components required for a BMCFW 102 to operate are illustrated in FIG. 1.
In the example of FIG. 1, the BMCFW 102 stored in a FLASHROM (non-volatile flash memory) is activated by a service processor (SP) 103 upon power-on of the MGMT 101 and then starts its software operation. The SP 103, which is referred to as “baseboard management controller (BMC)”, is a microcontroller serving as a control center for the MGMT. The SP 103 includes a Central Processing Unit (CPU) and various connector interfaces such as a serial port, a LAN port, and USB port.
The operating system of the BMCFW 102 is loaded on a memory 104 and starts program operation on the memory 104. The operating system is, for example, Embedded Linux. The memory 104 is referred to as “synchronous dynamic random access memory (SDRAM)”. The data content of the memory 104 is erased by a reset of the SP 103 or power disconnection of the MGMT 101 and may not be guaranteed.
Stall monitoring of the BMCFW 102 is performed by hardware using a Programmable Logic Device (PLD) 105. If a stall is detected, the PLD 105 resets the MGMT 101 and restarts the BMCFW 102. The stall means a halt state (stopped state) of the BMCFW 102.
FIG. 2 is a flowchart illustrating the flow of processing performed in the system of FIG. 1. The BMCFW 102 of FIG. 1 is an embedded Linux, in which application programs operate on an operating system called kernel. If a failure occurs inside the kernel (step S100 in FIG. 2), the kernel halts (step S101 in FIG. 2).
The PLD 105 constantly monitors the BMCFW 102. If there is no reply for a certain period of time from the BMCFW 102, the PLD 105 determines that the BMCFW 102 has halted (step S102 in FIG. 2). This function of the PLD 105 is referred to as “Watchdog Timer (WDT)”.
Upon activation of the WDT, the PLD 105 issues a reset to the SP 103 (step S103 in FIG. 2). Then; the BMCFW 102 is released from the halt state and restarted (S104 of FIG. 2). Upon activation of the BMCFW 102, the system operation is restarted (step S105 in FIG. 2). In the method, the use of the WDT allows the BMCFW 102 to be released from its halt state.
The following Patent Documents 1 to 4 disclose techniques relating to the above-described server system.