1. Field of the Invention
This invention relates to creation of standby servers and near clones for rapid failover provisioning in a networked computing environment.
2. Background of the Invention
As business demand increases, the depth of technology improvements and its intelligence to handle multiple processes becomes highly desirable and crucial. In any enterprise operation, it is difficult to effectively manage the ever-fluctuating resources available, while maximizing the resource utilization.
In fact, Information Technology (“IT”) costs can become very expensive when maintaining sufficient resources to meet peak requirements. Furthermore, user inputs are generally required to facilitate such processes, which incurs additional costs in both time and human resource demand.
To address these needs, many large vendors of enterprise computing systems, such as International Business Machines (“IBM”), Hewlett-Packard (“HP”), Microsoft Corporation, and Sun Microsystems (“Sun”), have begun to develop and deploy infrastructure technologies which are self-managing and self-healing. HP's self-managed computing architecture is referred to “Utility Computing” or “Utility Data Center”, while Sun has dubbed their initiative “N1”. IBM has applied terms such as “Autonomic Computing”, “Grid Computing”, and “On-Demand Computing” to their various architecture and research projects in this area. While each vendor has announced differences in their approaches and architectures, each shares the goal of providing large-scale computing systems which self-manage and self-heal to one degree or another.
For example, IBM's Autonomic Computing is a self-managing computing model which is patterned on the human body's autonomic nervous system, controlling a computing environment's application programs and platforms without user input, similar to the way a human's autonomic nervous system regulates certain body functions without conscious decisions.
Additionally, IBM has defined their On-Demand Computing technology as an enterprise whose business processes, integrated end-to-end across the company and with key partners, suppliers, and customers, can respond quickly to any customer demand, market opportunity, or external threat.
“Provisioning” is a term used to describe various aspects of managing computing environments, and which often implies different things to different parties. Throughout the present disclosure, we will use the term “provision” or “provisioning” to refer to a sequence of activities that need to happen in a specific order in order to realize a computing environment to meet specific needs and requirements. The activities have dependencies on previous activities, and typically include:                (a) selecting appropriately capable hardware for the requirements, including processor speed, memory, disk storage, etc.;        (b) installing operating system(s);        (c) remotely booting networks;        (d) configuring networks such as Virtual Private Networks (“VPN”) and storage environments like Storage Area Network (“SAN”) or Network Attached Storage (“NAS”); and        (e) deprovisioning resources that are no longer needed back into an available pool.        
Operating environments in large data centers have become increasingly complex. These data centers usually require a long time to modify their environments, so most provision for the worst-case scenario, often configuring more hardware than is needed just in case a peak requirement is experienced. As a result, most hardware and software resources are under-used, increasing the costs of the system considerably. Furthermore, the issue of surges beyond what has been provisioned remains unaddressed (e.g. peak demands above the anticipated peak load).
In fact, provisioning is typically a time and labor consuming process consisting of hundreds of distinct and complex steps, and requiring highly skilled system and network administrators. For example, server provisioning is the process of taking a server from “bare metal” to the state of running live business transactions. During this provisioning process, many problems may arise such as increases in resource expense and declines in level of performance, which in turn can lead to dissatisfied customers and unavailability in services.
Because these are predictable issues, automation can be employed to manage these problems. One objective of the various self-managed computing systems being offered by the major vendor is to automate to an extent as great as possible these provisioning activities, and especially to allow for near real-time reactions to changes in system requirements and demands, with little or no human administrator intervention. For example, IBM's Tivoli™ Provisioning Manager (“TPM”) Rapid Provisioning is a modular and flexible set of workflows and scripts for the IBM Tivoli Intelligent Orchestrator product. The workflows have been generalized and packaged for customization by customers seeking to accelerate their provisioning process. They can be used as a starting point for an organization in automating not only their server provisioning process, but also other IT processes.
Other products currently offered by the major vendors include HP's OpenView OS Manager using Radia which is a policy-based provisioning and ongoing automated management tool for a variety of operating systems, and Sun's N1 Grid Service Provisioning System automates to some degree the provisioning of applications.
“Disaster Recovery” is a broad term used in information technology to refer to the actions required to bring computing resources back online after a failure of the existing system, be it a small failure such as the failure of a single heavily loaded server among many servers, or a large failure such as loss of power or communications to an entire computing center. These types of disasters may be caused by failure rates of the components (e.g. hardware and software failures), as well as by non-computing factors such as natural disasters (e.g. tornados, hurricanes, earthquakes, floods, etc.) and other technical disasters (e.g. power outages, virus attacks, etc.).
To recover from a disaster, a computing center must re-provision new servers and systems to replace the processing which was being performed by the previous system(s). Often times, the recovery is performed in a different geographic area, but sometimes the recovery is performed in the same physical or geographic location, depending on the nature of the disaster or failure.
Many businesses which employ or rely upon enterprise computing, create disaster recovery plans to be better prepared when the occasion arise. However, current technology only allows for dedicated servers to be implemented. Each server is typically committed to one purpose or application (e.g. a “solution”), whether it is to meet a new customer requirement (e.g. a “production system”), or to be solely used as a backup server for an existing server that may crash in the near future. When these dedicated servers are not in use, the overall IT maintenance costs increase while excess resources remain idle and unused. It is important to note that in order to save critical time during recovery, when a server is configured as a backup of a production server, the backup server's configuration is usually matched to the production server's configuration so that there is no provisioning time required to bring the backup server online and operational.
Disaster recovery implementation remains challenging even though provisioning with orchestration enables new approaches that are not dependent on high availability operating environments such as IBM's z/OS mainframe operating system, clusters, and addresses. When a disaster occurs, the server will either be reinstalled or once it reaches the end of its usefulness, it will be replaced by a newer version with more features and higher reliability.
During recovery and the process of bringing on line a backup server, often network issues arise, such as Internet Protocol (“IP”) address conflicts, during a period when a degraded or partially operating production server and a newly started backup server operate at the same time.
Further, moving either static configuration data or dynamic state data from a failed or degraded production server to the backup server remains a complicated and difficult procedure, as well.
As a result, once a production server has been deployed in a production environment, it usually is used until a disaster happens, which repeats the provisioning process again while its old implementation problems remains unresolved.
Turning to FIG. 3, a logical view of how one available provisioning manager manages an application cluster (30). The management server (36) gathers the information on resources and then the management services (37, 37′, 37′″) monitor any processes currently being performed or executed. The network pool (31) includes components such as routers, switches, fabrics and load balancers for the network environment. The application pool (32) typically includes a first tier of the applications operating on the servers, such as a databases (e.g. IBM DB2, Oracle, etc.), running on top of server platform for suite (e.g. IBM WebSphere or equivalent).
The application resource pool (33) is a group of available, unassigned, unprovisioned servers that can be provisioned (38) into the active application pool. The back-end resource pool (34) contains any backup servers necessary for the application pool (32), such as another set of database servers or web server. The backend pool (35) serves as the collection or group of available servers that have been provisioned (38′) from the back-end resource pool (34).
As such, during disaster recovery, the aforementioned tedious and laborious provisioning activities may have to be performed to realize many servers and many configures, selected from several pools, in order to restore an enterprise.
Another growing business trend in the industry is “sub-capacity pricing”. Subcapacity licensing permits a customer to license a product for use on less than the full capacity as determined by the number of processors of the computing platform. For example, with certain IBM products such as IBM iSeries and AS/400 hardware platforms platforms running OS/400 V4R4, or later, the system is isolated to a logical partition using IBM-provided tools. With subcapacity pricing, the number of processors which need to be licensed is the aggregation of all processors, including partial shares of a processor, across all partitions of a single machine (where the program is defined), rounded up to the next highest whole number. Thus, a customer can obtain considerable costs savings by sub-licensing, if a platform supports it. However, this presents a specific challenge to configuring backup servers and replacement servers, especially in situations where configuration time is of the essence.
Additionally, some pools of servers may be governed by overall licensing restrictions, such as a restriction as to how many copies of a particular program may be running simultaneously across multiple servers in the pool. Taking into account these limitations can also present a considerable challenge to quickly configuring a replacement or backup server.