1. Field of the Invention
The present invention is related to Virtual Machines, and more particularly, to handling guest I/O device timeouts generated in the host OS.
2. Description of the Related Art
A Virtual Machine (VM) is a type of an isolated Virtual Environment where multiple VMs can run on the same physical machine simultaneously. Each VM instance has a set of its own software components (including OS) and uses hardware modules of the physical machine where the VM resides.
Often, there are multiple VMs running on a host operating system. In such system, some resources of the host operating system are isolated and allocated for running each of the VMs. With Virtual Machine (VM) technology, a user can create and run multiple virtual environments on a server at the same time. Each virtual environment, such as a VM, requires its own Guest Operating System (GOS) and can run applications independently.
One common problem that many modern Virtual Machines face is that the guest operating system frequently makes requests to its own virtual hardware devices (which are, in fact, emulated by the VMM or hypervisor), and waits for response from the device. Examples of devices that utilize such access requests are disk drives, DVD drives, CD ROM drives, some network access devices, and so on. Any operating system (whether virtualized or not) sets a timeout period for the device to respond, which is typically on the order of about 5 seconds for network devices, 10 seconds for hard drives, 30 seconds for DVD, and so on. As far as the guest OS is concerned, the timeout is the same as if real hardware were involved, since the guest OS does not realize that it is a guest, and believes that it is working with real hardware.
If the virtual device does not respond within the timeout period, the operating system will typically make one more request to the device, in some cases two more requests, normally with the same timeout period. If the device, such as the disk drive, has not responded in that time, the guest OS normally enters some sort of a fail mode—as far as it is concerned, its file system is inaccessible, or is treated as read only, and the only way to recover the Virtual Machine is to restart the machine from scratch or from some previous state, once the hardware device in question is back on line. Note that although the guest OS makes the assumption that the device has failed, this is not necessarily the case when virtualized systems are involved—for example, the device might be in use by other Virtual Machines or by the host OS, or the device might be a network storage device, i.e., the physical device is actually located remotely, and may be temporarily inaccessible due to network connection issues, network protocol issues, and so on.
The upshot of all this is that the inaccessibility of the device is temporary, while the guest OS running inside the Virtual Machine assumes (like any operating system would assume in this case) that the failure to respond is permanent, and will therefore return an error and/or crash. Note that this applies to guest OS's with dynamic translation, and with hardware support for virtualization (where the problem “commands” from the guest are intercepted by the hypervisor, and replaced with safe commands). Examples of such virtualized operating systems are available from VMware, z/VM, etc. Examples of full virtualization of MICROSOFT Windows may be found in the VMWARE ESX server, MICROSOFT Virtual PC, Parallels Desktop and so on.
The same problem also affects paravirtualization schemes, where the guest OS kernel is only modified in a relatively minor manner, and given the ability to access real hardware, and where the hypervisor provides the host OS with a guest API. Even though paravirtualized device drivers are aware of the existence of the host OS and of the existence of time lags when accessing devices (and therefore do not always post timeouts). The setting of a timeout is a way to determine that the physical hardware device is unavailable or is turned off, however, in this case paravirtualization software can ask the host OS directly to determine the reason for the device not responding, and make a decision about what the guest OS should do—re-send the request to the device, or shut down the guest OS. Examples of paravirtualization systems using LINUX are XEN and UML (User-mode LINUX). KVM is an example of a Windows paravirtualization scheme.
Once the guest OS “hangs”, the only way to reanimate it is to reload it from scratch. Such OS “hangs” happens usually if there are problems with the HDD, while problems with network devices may cause less extreme OS behavior. The 10 second timeout period (for hard drive, for example) is justified if the operating system in question is the host OS, installed on a host machine, where the host OS addresses the devices using its own native drivers. On the other hand, if the request to access a device actually comes from a guest OS, then the standard timeout periods are frequently insufficient, since there are many intermediate processes involved, before the request finally reaches the device and is returned back with some value or data. This is a particular problem in the context of network file systems and network-based storage devices, where the time lags are even greater.
Accordingly, there is a need in the art for a mechanism to handle device I/O timeouts for Virtual Machines that addresses the uncertainty in device response time due to virtualization issues.