Cloud computing environments enable the provisioning of infrastructure, platforms, and software, all of them generalized as resources available as a service via an Application Programming Interface (API), most commonly over a network protocol or web service API. When using these cloud resources, there are varying Fault rates in acquiring, as an example, server instances that will run a user's application.
In existing systems, the process of interacting with resource APIs include:
User client application makes an API request that requires resources from a Cloud using the Cloud's API
The servers implementing the Cloud API allocate all or part of the resources from the infrastructure available in the Cloud that are required to fulfill the request
Once resources are allocated, the request is fulfilled
As an example of current systems, a request for server instances or a service that requires server instances to be implemented, can be fulfilled and given to a user. In practice, some of these nodes contain resources that are and are not functioning properly, and would not execute their intended workload due to various potential faults. Specifically, on today's cloud environments 0.5% to 40% Fault rates occur for server instances, which when they occur cause the requested service or the systems that the user is using the resources for to not function properly. This situation becomes a particularly acute problem with larger numbers of requested resources, as even small Fault rates can represent large numbers of faulty resources.
The current art in adaptive cloud infrastructures deals with load (U.S. Pat. No. 8,458,717) and disaster recovery based scenarios (U.S. Pat. No. 8,381,015) without considering the health/viability of infrastructure within the cloud in general. The present invention addresses this shortcoming by disclosing a system and method for running checks and resolving errors in the infrastructure automatically, either as part of management software or cloud provider operations, which allows for efficient rerouting across healthy infrastructure resources.
In one aspect of the invention, software that manages creating individual infrastructure or clusters of infrastructure, responds to a user request for more resources by acquiring them from a cloud provider, checking the resources provided by the cloud provider for faults, resolving them appropriately either through a solution or through requesting new or more infrastructure. Faulty infrastructure may be held on to before requesting new infrastructure, or using scripts to resolve the fault or remove the infrastructure. The client request then receives fully working infrastructure for use.
In another aspect of the invention, the a cloud provider accepts web service requests to acquire virtual machine resource(s) or a platform that is powered by a cluster of virtual machine or bare metal resource(s). After the request for new instances come in, the infrastructure required to respond to the request are either checked and resolved at request time or picked from an asynchronously determined list of healthy resources. The response to the web request or the cluster of resources provisioned to provide a working service would then contain a majority of healthy resources that have been vetted by various checks.