This invention relates to a virtual machine, and more particularly, to a technology for processing error recovery of an I/O device without stopping a guest OS running on a virtual machine.
With improvements in performance and enhancement in functionality of open-system servers, server virtualization software (hypervisor or VMM) is widely used as a method of effectively making use of processor cores mounted to physical servers. The hypervisor creates a plurality of virtual machines on one physical server, and controls an OS or applications to run on each of the virtual machines. In recent years, concurrently with improvements in performance of processors, it is not a rare case to operate ten or more virtual machines (LPARs) on the physical server. However, as the number of virtual machines running on the physical server increases, two problems with the I/O device have emerged.
Problem 1 (performance problem): The intermediation of the hypervisor is essential in order to realize “I/O sharing” for controlling a physical I/O adapter to behave as a plurality of virtual I/O adapters. In this case, due to the hypervisor's intermediation overhead, limitations are imposed on I/O performance available to the virtual machine.
Problem 2 (reliability problem): Up to now, the open-system server is not provided with a mechanism for transmitting a detected I/O adapter error to software such as an OS. Therefore, if an error occurs in the physical I/O adapter, the entire physical server always goes down because the scale or type of the error cannot be determined, which damages all the virtual machines running on the physical server.
In order to solve those problems, PCI-SIG has specified two standards, the I/O Virtualization (IOV) and the Advanced Error Reporting (AER).
The IOV is a system for providing a main portion of the “I/O sharing” in a hardware form in order to solve the above-mentioned performance problem. With the use of the IOV, the intermediation of the hypervisor is limited to low-frequency processings such as initialization, and hence high I/O performance becomes available to the virtual machine.
Meanwhile, the AER is a system for transmitting information on the I/O adapter error to the software such as an OS in order to solve the above-mentioned reliability problem as described in PCI Express(r) Base Specification Revision 2.1 (§7.10. Advanced Error Reporting Capability). According to the AER, the software such as an OS becomes capable of determining the degree of seriousness of the error, and if the error is minor, it is possible to recover the I/O adapter by a method such as resetting and to continue operating the physical server and the virtual machines.
It should be noted that as a technology that utilizes the IOV, US 2009/133016 discloses a basic operation of the hypervisor. US 2009/133016 involves a technology that utilizes hot plugging (technology for mounting/removing a physical function (hereinafter, referred to as “PF”) while the server is running) of the PF as a method of handling an error of the PF of the I/O adapter compatible with the IOV. With the use of the technology described in US 2009/133016, if an error occurs in the PF, maintenance personnel, an administrator, or the like can replace the I/O adapter.