This invention relates to the field of fault management. In particular, the invention relates to fault management in virtual computing environments.
It is common to run operating systems in virtual environments. These in turn are used to run applications that implement a range of services. Each Virtual Machine (VM) directly replicates a physical computer but is run under a hypervisor on a physical host machine. A host machine can host several VMs. To maximise host machine utilization and increase fault tolerance, VMs are often run on a cluster of host machines. If one host machine fails then the VMs can be moved (or migrated) to run on another host machine in the cluster.
Faults may occur on VMs in a similar way to how they occur on physical machines. Fault management systems can be used to detect and monitor these problems and report them to an operator allowing rapid resolution. For example, IBM Netcool is a service level management system that collects enterprise-wide event information from many different network data sources including fault events (IBM and Netcool are trade marks of International Business Machines Corporation).
In a virtual environment, faults may be caused by faults on the host hypervisor system that is running the VM. If many VMs are being run by a single host this can potentially result in a flood of faults being reported that are not caused by faults on the VMs themselves. This can be confusing and time consuming for an operator to work through and fix quickly. Furthermore, even if hypervisor fault monitoring is also implemented this (often less severe) root cause fault can be lost in the flood of VM fault events and overlooked by the operator.
In addition, one way of resolving some faults on VMs is to move them to a different physical host machine. This will result in an instant resolution of some problems but traditional fault monitoring systems running on these VMs can be slow to update this change in status and clear the problem.