The term cloud computing in its most general meaning is used to describe the concept of using computing resources e.g. hardware and software that are delivered as a service over a network e.g. Internet. The name is derived from the nowadays-common use of a cloud as an abstraction for the complex infrastructure it contains in system diagrams. Another way of formulating it is that cloud computing enables convenient, on-demand network access to a shared pool of configurable computing resources (e.g. networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Typically, cloud computing is described as a stack due to the number of services built on top of each other. These services can be divided into three main models, namely Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS), where IaaS is the most basic model. In brief, the different models can be described as follows:
SaaS cloud providers install and operate application software in the cloud and cloud users access the software from cloud clients. The cloud users do not manage the cloud infrastructure and platform on which the application is running.
PaaS cloud providers deliver a computing platform typically including operating system, programming, language execution environment, database and web server.
IaaS cloud providers offer computers, physical or virtual machines and other resources, which is the focus of the current disclosure and will be described further below. In particular, the current disclosure is aimed at fault management in Infrastructure-as-a-Service (IaaS) Cloud Systems. Infrastructure-as-a-Service is a delivery model in which an organization outsources the physical equipment used to support operations, including storage, hardware, servers and networking components. Usually virtualization is used as means to provide client isolation and resource multiplexing. Typically the infrastructure requested by the client may scale with time (grow or shrink). The client typically pays on a per-use basis.
Within cloud services, and in particular for IaaS clouds, one difficulty is how to handle faults that occur in e.g. the data center (physical servers, storage, network etc) and how these faults affect the applications or services that utilize the cloud. Since an application or service is unaware of the hardware functionality of the infrastructure, it is difficult to implement any way of handling faults. Today there exist a number of different methods of handling faults in a cloud infrastructure.
Rackspace, one of the largest IaaS providers in USA, has defined a Cloud Monitoring Application Programming Interface (API) [1]. That API is RESTful based and allow for creation of checks, alarms, notification plans, amongst others. If a check condition is detected, an alarm is fired, and a notification plan can be used. However, a notification plan is a plain e-mail that is sent to a human operator, which in its turn can act on the matter. A webhook is also supported but it does not notify the application itself either.
Amazon WS has also provided relevant related solutions to that problem. A good example is the Amazon CloudWatch [2], a service that monitors the performance of VMs through metrics. If a metric is outside of specified levels, the cloud management system may take actions. The defined metrics are CPU utilization, latency and request counts, or custom. The types of alarms raised are OK, ALARM, or INSUFICIENT_DATA (no reading was possible).
However, this solution focuses only on the performance VMs, not providing a full framework for fault monitoring and notification.
Another relevant service is the Amazon Simple Notification Service [3]. It provides a way for applications to notify events to subscribers typically used for implementing the application logic itself. It builds on top of the Simple Queue Service (a service similar to the one used in this invention). However, this is only a notification bus as any other; it does not provide a solution for fault monitoring.
Consequently, there is a need for methods and arrangements for enabling improved fault monitoring and management in IaaS clouds.