The present invention is related to determination of the availability of computer systems and other complex electronic systems. These complex systems are logically viewed as collections of cooperating components. Components may include physical hardware components, such as disk drives, disk arrays, busses, networks, processors, and other such physical components. Components may, in addition, include software, including computer operating systems (“OSs”), database management systems (“DBMSs”), commercial server software, communications software, and a wide variety of application programs. In general, system availability is computed from component failure and recovery rates.
FIG. 1 is a graphical representation of the time-dependent operation of a system component. The system component may be in an operational state, corresponding to an instantaneous availability of 1, or may be in a failed state, corresponding to the instantaneous availability of 0. Thus, in FIG. 1, a binary operational state model is employed. It is possible to use intermediate instantaneous availability values between 0 and 1 to indicate intermediate operational states between fully operational and fully failed. However, a reasonable system availability determination can be calculated using a binary state model.
In FIG. 1, the vertical axis 101 represents the availability of the component and the horizontal axis 103 represents time. The component is assumed to be fully operational at time t=0 (105 in FIG. 1). The component continues to be fully operational until a time t1 (107 in FIG. 1), at which point the component fails. Between times t1 and t2 (109 in FIG. 1), the failed component is detected and repaired so that, at time t2, the component again becomes operational. The component remains operational until a subsequent failure at time t3 (111 in FIG. 1). The time interval 113 between time t=0 and time t1, is the time to first failure of the component. The time interval 115 between time t1 and time t2 is the time to repair the component following the first failure of the component. The time interval 117 between time t2 and time t3 is the time to failure for the component following initial repair. In general, the various times to failure for a component are distributed according to some probability distribution function, as are the times to repair the component once it has failed. In one test, the component may fail after 2000 hours of operation, while in a subsequent test, the component may fail after 4000 hours of operation. Thus, the component failure and repair characteristics are often probabilistically represented by the component's mean time to failure (“MTTF”) and mean time to repair (“MTTR”).
FIG. 2 is a graphical representation of component reliability. In FIG. 2, the vertical axis 101 corresponds to the probability that a component survives, or continues operation, and the horizontal axis 103 corresponds to time. Initially, at time t=0 (105 in FIG. 2), the component is operational. At subsequent times, the probability that the component is still operational or, in other words, has not failed, decreases, initially gradually, then more steeply through inflection point 107, and again more gradually as the probability of survival of the component approaches 0 as time increases towards positive infinity. The reliability of a component at time t is the probability that the component has survived to time t without failing. For example, the reliability R(t1) of the component at time t1 (109 in FIG. 2) is represented by the y-coordinate of the point 113 at the intersection of a vertical line 111 from time t1 and the reliability curve 115.
The time-dependent occurrence of failures of a component can also be described by a probability distribution function. A probability density function ƒ(t), the first derivative of the probability distribution function with respect to time, corresponds to the probability distribution for component failures. FIG. 3 is a graph of the probability density function ƒ(t). The probability density function ƒ(t) is normalized so that the area 301 underneath the curve ƒ(t) 303 is 1. The probability that the component will fail within a time interval is the area beneath the probability density function curve 303 within the time interval. For example, the probability that the component will fail between times t1 (305 in FIG. 3) and t2 (307 in FIG. 3) is shown, in FIG. 3, by the crosshatched area 309.
Instantaneous availability, A(t), is the sum of the probability that the component has survived in an operational state without failing until time t, plus the probability that the component has survived in an operational state since its most recent recovery prior to time t, given that one or more failures have occurred prior to time t.
FIG. 4 is a graph of the instantaneous availability A(t) and the steady state availability Ass for a component of a complex system. The instantaneous availability A(t) 401 starts with a value of 1 at time 0, presuming that the component has been correctly installed and verified to be operating correctly, and decreases towards the steady state availability ASS as time increases. In fact, the steady state availability is the limit of the instantaneous availability with increasing time t:             lim              t        →        ∞              ⁢                   ⁢          A      ⁡              (        t        )              =      A    ss  It can be mathematically shown that while instantaneous availability depends on the nature of both the failure time and recovery time distributions, steady state availability depends only on the means of the failure and recovery time distributions.
The hardware components of the system can sometimes be rigorously tested in order to obtain reasonably precise estimates of the MTTF of the components. Alternatively, the MTTF of the components can be empirically determined from either field failure rate data of the same or similar components, or predicted from models based on the physics of the design of the components. In either case, the reliability and expected lifetimes of hardware components are relatively well characterized, in general. The availability of the hardware components of a complex system can be determined by methods that will be discussed in greater detail, below.
System availability calculations are not simple linear combinations of terms related to component failure and recovery rates. Moreover, methods for determining overall system availability are currently applied to complex systems primarily with respect to the hardware-component contribution, without considering many other types of complex system failure modes.
Knowing the purely hardware-component-failure behavior of a complex system is an important part of determining the overall availability of the complex system, but is only one part of a complex calculation. Other types of failures include failures of various software components, including OSs, DBMSs, and application programs. Complex systems may also be unavailable due to planned maintenance of both hardware and software components, other planned and unplanned events, such as operator errors, and to various types of catastrophic events. A significant problem, recognized by system administrators, system manufacturers, and system configuration experts, is that the precision with which hardware reliability can be determined is currently not available for determining the reliability of software components, in general. Furthermore, current availability-assessment tools do not provide a deterministic method for combining relatively precise hardware-component reliability data with far less precise software-component reliability data and data related to other types of events and activities that contribute to overall system downtime. Instead, overall availability assessments currently result from rather non-deterministic and non-uniformly applied estimations. Users of complex computer systems and other electronic systems, system designers, system manufacturers, system administrators, and system configuration professionals have recognized the need for a deterministic and accurate method and system for assessing the overall availability of complex computer systems and other complex electronic systems.