In a cloud computing environment, computing is delivered as a service rather than a product, whereby shared resources, software and information are provided to computers and other devices as a metered service over a network, such as the Internet. In such an environment, computation, software, data access and storage services are provided to users that do not require knowledge of the physical location and configuration of the system that delivers the services.
The functions of the cloud computing environment are performed by a data center, which includes disparate hardware components (e.g., storage controllers, network switches, physical compute machines) which are integrated amongst each other. Currently, hardware failures, such as central processing unit core failures, dual in-line memory module failures, adapter card failures, etc. are reported to the hardware management components, which may later be reported to the customers.
Since the data centers of cloud computing environments can be large (large number of hardware and software components) and complex, the failure reporting can be complex and exhaustive. Furthermore, since the data centers of cloud computing environments can be large and complex, response systems have difficulty in responding to such hardware failures in a manner that ensures continuity of service for the customer that meets the customer's service requirements. Such response systems respond to hardware failures based on locating alternative devices to continue the processing of the failed hardware without understanding the context of the software running on the hardware. For example, a response system may respond to a hardware failure by transferring the processing of the failed compute machine to a new compute machine to handle. By not taking into consideration the context of the software running on the hardware, other alternatives that may be viable, such as creating a new virtual machine to make up for the lost capacity, are not considered. As a result, such response systems are deficient in responding to hardware failures thereby degrading system performance.