This invention relates to technology of an error processing in a computer system.
In recent years, vertically integrated systems have drawn attention. A vertically integrated system is composed of hardware such as a storage apparatus, a server apparatus, and a network apparatus, software such as a data base, an application, and a middleware, and a tool for unifying the management thereof.
As to a vertically integrated system, a vendor providing services and apparatuses builds a system suitable for the business needs of the user and provides the system to the user. The user can quickly procure an optimum system for the business when they need and operate the system immediately.
The vertically integrated system can employ a configuration where the server apparatus is steadily connected with the storage apparatus to achieve high I/O performance. In such the configuration, a plurality of computers are connected with devices capable of high-speed data communication by PCI Express (hereinafter, PCIe), and like. For example, the server apparatus and the storage apparatus are connected via an I/O device which is PCIe compliant.
In some systems having the above-described configuration, the power supply unit for supplying power to the device is different from the power supplies for supplying power to the apparatuses. In a case where the power supply unit for the device is stopped in the system with such the configuration, the server apparatus or the storage apparatus detects it as an error of the device (hardware error).
For example, in a case where a graphical processing unit (GPU) to be driven by external power is connected to the server apparatus, the power supply unit for the server apparatus is different from the power supply unit for the GPU. In another case where the server apparatus and the storage apparatus are connected via a non-transparent bridge (NTB) device which is PCIe compliant and the NTB device is driven by the power supplied by the power supply unit for the server apparatus, the power supply unit for the storage apparatus is different from the power supply unit for the NTB device.
In a system where the server apparatus and the storage apparatus are connected via the NTB device, the PCIe link is disconnected in a case where the power supply unit for the server apparatus is stopped as scheduled. The storage apparatus detects it as an error of the NTB device and performs predetermined error processing. As a result, processing to block the NTB device is performed or an instruction to replace the NTB device is issued. Usually, in a case where the processing to block the NTB device has been performed, the NTB device is not used until the NTB device is replaced by a new one.
However, the disconnection of the PCIe link is caused by the stop of the power supply unit of the server apparatus and no error occurs in the NTB device. Accordingly, in a case where the power supply unit of the server apparatus is reactivated, the existing NTB device needs to be used automatically without replacement with a new one. In other words, in a case where the power supply unit supplying the power to the NTB device stops, it is necessary to prevent the server apparatus or the storage apparatus from detecting it as an error of the NTB device.
To solve the foregoing problem, a traditional technique hot-removes the NTB device from the server apparatus or the storage apparatus before stopping the power unit of the server apparatus. This technique requires operations to hot-remove the NTB device from the storage apparatus before operations to stop the power supply unit of the server apparatus. Teaching the user or the maintenance person the procedure and making them strictly follow the procedure are difficult; there is a possibility of false operation.
Another problem arises that, if an error occurs in the power supply unit of the server apparatus, the storage apparatus may wrongly detect it as a hardware error of the NTB device.
Accordingly, in a case of stopping of the power to the NTB device before the hot-remove operations are performed on the storage apparatus, it is necessary to prevent the storage apparatus from detecting it as an error of the NTB device.
U.S. Pat. No. 8,140,922 B discloses a method of sending detailed information on an error to the connected server apparatus or storage apparatus when the error occurs.