1. Field of the Invention
The present invention relates to characterization of availability and/or reliability of systems, and in particular, to methods of calculating and employing availability metrics that include outage and/or unavailability characterizations.
2. Description of the Related Art
Availability has become a universal concern for businesses of all sizes. Companies in every industry have developed an increased dependence on technology and information. Applications that make use of this information such as data warehousing, data mining, enterprise resource planning, and email have exploded in corporate infrastructures and have become essential to the conduct daily business. Globalization of business requires 24-hour application availability and eliminates periods of “acceptable” downtime. In the fast-paced environment of Internet access, downtime for one business becomes an instantaneous opportunity for another. In such a circumstance, application downtime can jeopardize not only the immediate business opportunity but also the customer and its future potential.
Under these pressures, companies must examine the impact each application has on their business. Applications vary in their importance along a continuum from the most important, mission-critical applications to less important, task-critical applications. Mission-critical applications impact revenue or service and cannot tolerate downtime. Task critical applications, by comparison, can handle some downtime as the primary effect of that downtime is inconvenience. By determining how critical an application is, an appropriate trade-off can be made between cost and availability. For example, a task critical application can tolerate more downtime because the costs associated with that downtime are relatively low. A mission critical application, by contrast, requires the highest availability because lost service is extremely costly. As the cost of downtime increases, businesses are challenged to improve their application availability.
To achieve maximum application availability, IT organizations must reduce both planned and unplanned downtime. Planned downtime results from known and predictable events that render the application unavailable for a predetermined amount of time. Examples of planned downtime include software and hardware upgrades. Unplanned downtime, by contrast, cannot be controlled and can occur as a result of human error or system failure. Although planned downtime accounts for the majority of total downtime, it is unplanned downtime that typically has the greatest business impact.
In order to meet the requirements of critical applications, IT managers must use a complete definition of “availability.” From an end user's perspective, application availability is not simply whether it is possible to access an application. The concept of availability must also consider the performance and behavior of the application, or in other words, the service level provided. For example, if an end user can connect to a web site, but it takes several minutes to load each page, he/she may abandon the site and look for an alternative. The end result is the same as if the site had been unavailable for connection. So, complete availability planning should address both application access and the quality of the service provided Downtime, whether planned or unplanned, is the result of process, people or product related events and errors. Planned downtime, which includes software and application updates, is usually the result of necessary IT processes or product updates. Unplanned downtime has a different composition. According to industry analysts, process and people errors each account for 40% of unplanned downtime while product errors account for 20% of unplanned downtime. Process-, people- and product-related errors can be defined as follows. Process-related errors include those that result from poorly defined, planned or documented procedures during activities such as backup, change management or problem management. People-related errors can be introduced through any non-automated task that requires human intervention. People-related errors are often the result of inadequate training or lack of expertise. Product-related errors include operating system errors, hardware failure, power outages and disasters. To minimize downtime, companies need to take a comprehensive approach to assess and address all three sources of downtime-process, people and product.
A variety of measures have been used to characterize availability or reliability of systems. For example, availability of a system can be characterized as a function of time, A(t), which is the probability that the system is operational at the instant, t. If A(t) approaches a limit as t goes to infinity, then steady state availability, A, expresses a fraction of time that the system is available to perform useful computations. For example, a system which is available 99.5% of the time is said to have an availability metric, A=0.995. Availability is typically used as a figure of merit in systems in which service can be delayed or denied for short periods of time without serious consequences. Reliability, R(t) is another metric and is typically defined as the conditional probability that the system has survived the interval [0, t], given that it was operational at time t=0.
Other commonly used metrics include various “mean time” measures such as Mean Time To Failure, MTTF, which can be expressed as the integral (over time) of the reliability function, R(T). In some utilizations, metrics are calculated from probabilistic models of component failure rates. In others, metrics are calculated based on statistical methods using actual failure statistics. Other useful metrics include Mean Time Between Failure (MTBF), Mean Time To Repair (MTTR), etc. See generally, Siewiorek & Swarz, The Theory and Practice of Reliable System Design, Digital Press, pp. 201-297 (1982) for a discussion of evaluation criteria and metrics.
Unfortunately, conventional availability or reliability metrics typically fail to account for business impact of failures. As a result, such metrics are not particularly useful in a feedback process for maximizing a level of customer perceived availability or reliability.