1. Field of the Invention
The present invention relates generally to the art of designing multi-tier system architectures, and more particularly to producing a design or set of designs meeting a high level set of performance and availability requirements.
2. Description of the Related Art
Certain businesses or other organizations deploying Internet and Enterprise services utilize components performing within multiple tiers. In such an environment, service downtime and poor performance either among individual components or within tiers can reduce overall productivity, revenue, and client satisfaction. The challenge in such an environment is to operate at efficient or sufficiently optimal levels of availability, where availability is defined as the fraction of time the service delivers a specified acceptable level of performance. Acceptable levels of performance may vary depending on the organization's business mission.
Component failure within the infrastructure supporting a service can adversely impact service availability. A “service” is a process that may run on one or more computing hardware components, and perhaps a large number of such components, including servers, storage devices, network elements, and so forth. Many of the hardware components run various collections and layers of software components, such as operating systems, device drivers, middleware platforms, and high-level applications. Performance of these components may be characterized by quantifiable statistics, including but not limited to component failure rates. For an individual component, if the component has a low failure rate in isolation, in total the combined infrastructure having multiple components can experience a significant rate of component failures. This significant component failure rate can in turn lead to frequent or extended periods of unplanned service downtime or poor performance.
The challenge in such an environment is to assess service availability and performance as a function of the different design choices including the type of components to be used, the number of these components and associated hardware and software configurations, and to select the appropriate design choice that satisfies the performance and availability requirements of the service at a relatively minimum cost.
Previously available assessment tools have been unable to automatically find a solution from this multi-dimensional design space that provides an enhanced cost-benefit tradeoff assessment to the user.
Currently available tools to select a design typically only enable evaluation of a single design. Since previous tools only evaluate single designs, system design has entailed employing human experts to manually define alternative designs satisfying the specific availability requirements. A primary disadvantage of the current approach is the need to employ an expert to carry out the design. Such experts may be in scarce supply or be relatively expensive. In addition, assessment and design according to the expert process is largely manual and likely slow. Finally, the final results of the manual design process are not necessarily optimal since they are guided mostly by experience and intuition rather than based on a systematic algorithm for searching the large, multi-dimensional space of candidate designs.
Automating the design and configuration of systems to meet user's availability requirements exists in very few situations. One system, an Oracle database design, implements a function that automatically determines when to flush data and logs to persistent storage such that the recovery time after a failure is likely to meet a user-specified bound. Automated design of storage systems to meet user requirements for data dependability have been considered, encompassing both data availability and data loss. Such technologies for automating subsystems, such as databases and storage systems tend to be domain specific and generally cannot be applied to designing multi-tier systems.
Certain previous attempts to manage component and configuration availability have been limited to automated monitoring and automated response to failure events and other such triggers. For example, cluster failover products such as HP MC/Serviceguard, Sun Cluster, and Trucluster detect nodes that fail, automatically transition failed application components to surviving nodes, and reintegrate failed nodes to active service upon recovery from the failure condition. IBM Director detects resource exhaustion in its software components and automates the rejuvenation of these components at appropriate intervals. Various utility computing efforts underway will also automatically detect failed components and automatically replace them with equivalent components from a free pool. Most notably, none of these products or processes provide an overall assessment for particular architectures, but merely react upon failure of a process, component, or tier.
One solution to providing automated design of multi-tier architectures is provided in U.S. patent application Ser. No. 10/850,784, entitled “Method and Apparatus for Designing Multi-Tier Systems,” inventors Gopalakrishnan Janakiraman et al., filed May 20, 2004 (the “Janakiraman reference”). This design provided for automated design of multi-tier systems, including a searchable and partitionable model and modeling solution usable in, among various scenarios, assessing design costs and selecting a design having a lowest cost.
The foregoing systems and implementations do not, however, account for different service characteristics, where certain services may exhibit different scalability properties. Certain services may only be able to run in a cluster with a predetermined number of resources, while other services may have the ability to run in one of multiple configuration options with a different number of resources, but cannot change the number of resources dynamically, or while the service is operational. Other types of services can change the number of resources used dynamically. The previous approaches, including the Janakiraman reference, cannot represent these different types of services.
Previous systems also do not account for a failure in one resource affecting the remaining resources supporting the service. The failure of one resource can cause other resources to fail. For example, a failure of one resource or node in an application that requires communication among nodes can cause the entire application to fail. Such a cluster wide failure scope has not been addressed in previous solutions, and knowledge and assessment of such characteristics are important to correctly model the availability of services having this type of failure behavior.
Further, previous solutions also do not offer the ability to represent certain types of availability mechanisms in assessing the availability of a service. Availability mechanisms are mechanisms that change the availability characteristics of a service, such as times to failure, service levels, and so forth. The Janakiraman reference specifically represents availability mechanisms that affect repair time associated with failures. Other classes of availability mechanisms, such as software rejuvenation techniques and checkpoint/restart mechanisms that affect other attributes may be employed in certain designs, but are not considered in prior solutions.
In addition, the Janakiraman reference only represents parameters describing systems characteristics using constant numeric and string values. Neither that solution nor any other known solutions can use general functions to describe performance characteristics of services and mechanisms and cost functions of components and mechanisms.
Based on the foregoing, it would be advantageous to offer a system and method for designing multi-tier systems that improves previously known solutions by supporting a wider range of services and design options.