System reliability is very important in the computing field. A common approach for providing good reliability or availability in computing is through the use of redundancy, in which two or more components are provided to perform a given task. If one of these components fails, the remaining component(s) are still available to ensure that the task is performed. Incorporating redundancy into a system however adds to the expense of the system. Consequently, it is undesirable to replicate a component within a system if such replication has only a slight or no impact on the reliability and availability of the resulting system.
An assessment of availability is usually based on a statistic such as the mean time between failure (MTBF) or the average (or annualised) failure rate (AFR). The AFR inherently expresses a probability of failure within a specific time window, normalized to one year, and hence may be considered as a more directly useful measure of failure rate than MTBF. The AFR associated with a hardware product such as a server, compute blade, switch, IO adapter, etc, is an indication of the expected availability of the product. The AFR for various products can therefore be used in making decisions about how much redundancy to include within an overall system, based on the desired level of availability for the system. The AFR for existing products has generally been estimated at product design time, based on statistical data for low-level components and production processes associated with the product, and has therefore provided a static view of failure probability.
Projected availability in terms of aggregate AFR may be used as one of the provisioning parameters when deploying services on a clustered or distributed system. For example, the billing to a customer may be based on a certain level of service availability being provided over a lease period. It is therefore important for the service provider to be able to project the availability of the system even if the configuration on which the service is implemented varies with time while the service is being provided.