A Cloud Computing model offers inexpensive on-demand computing facilities, providing incentives to end users to move away from managing their own information technology (IT) infrastructures. Such a model offers infrastructure and software services on demand to end users, in which the users need not maintain their own facilities to perform an IT task. Instead, end users use the computing facilities and software supplied by a provider (also referred to as a service provider), generally requiring the end users to only have a computer with minimum processing power that can connect to the Internet or a network and provide a “screen” through which commands can be submitted to the provider's computing facilities. Providers of Cloud Computing services typically use large infrastructures to leverage economies of scale, and virtualization on top of physical hardware (servers) to improve resource utilization. Managing such large infrastructures is extremely labor intensive, which does not bode well for introducing a paradigm shift in the industry where Cloud Computing services should be offered at competitive prices to induce end users to adopt the new model. The answer to that may be automation—of data center, middleware and application management processes. One area that is very labor-intensive is incident management in large data centers, and Cloud Computing service providers have addressed that problem with automation—monitoring IT infrastructure elements for evidence of faults (or for predicting impending ones) and taking simple corrective actions by leveraging some type of decision support system (e.g., a rule or policy engine or a finite state machine engine) which is used to represent knowledge of how to handle faults (incidents) and to exercise that knowledge in real time to provide automated incident management.
This disclosure describes a framework, based on the formalism of Finite State Machines (FSMs) that is used in the field of Computer Science, that can be used to build an Automated Incident Management System (AIMS) for large Cloud Computing environments, using the basic approach of representing policies for managing an IT element in an FSM definition or type, and tracking a deployed IT element using an instance of that FSM. Building upon off-the-shelf FSM engines available commercially or as open source software, it describes how to provide the properties of scalability, persistence, and fault tolerance to an FSM-based AIMS. Automation should be robust and provide certain guarantees about being able to survive faults in its own execution environment. The framework described here provides fault tolerance properties to an FSM engine for a fail stop fault model. Furthermore, scalability is provided for handling large Cloud Computing infrastructures where there may be many IT (hardware, middleware, application) elements being tracked by FSM instances, and persistence is provided for proper modeling and tracing of IT infrastructure elements whose operational life cycle can span months or years.