A computer system provides a collection of hardware resources such as processors, storage units, network interfaces, etc for the performance of computational tasks. These resources may be provided by a single machine or may be distributed across multiple machines. Many computer systems include multiple instances of a given resource. For example, a system may incorporate multiple processing nodes to provide redundancy in case one of the processing nodes fails, in that the other processing nodes remain available to carry out computational tasks. Incorporating redundancy into a system however adds to the expense of the system. Consequently, it is undesirable to replicate a component within a system if such replication has only a slight or no impact on the reliability and availability of the resulting system.
An assessment of availability is usually based on a statistic such as the mean time between failure (MTBF) or the average (or annualised) failure rate (AFR). The AFR inherently expresses a probability of failure within a specific time window, normalized to one year, and hence may be considered as a more directly useful measure of failure rate than MTBF. The AFR associated with a hardware product such as a server, compute blade, switch, IO adapter, etc, is an indication of the expected availability of the product. The AFR for various products can therefore be used in making decisions about how much redundancy to include within an overall system, based on the desired level of availability for the system.
In some computer systems, the provisioning or allocation of computational tasks to the available resources in the system is controlled by a resource manager. For example, if an application requires a particular amount of processing power and a particular amount of storage, the resource manager can allocate hardware resources to the application to provide the desired capabilities.
In many installations, the hardware resources belong to or are managed by a service provider, who runs (hosts) applications on behalf of one or more third parties (i.e. customers). The service provider has to match the available resources against the needs of the various applications. In some cases the resources allocated to a particular customer may be fixed, in which case the resource manager just has to balance the allocated resources against the applications for that one customer. In other cases, the resource manager may have to balance the available resources across multiple customers.
Projected availability in terms of AFR may be used as one of the provisioning parameters when deploying services on a clustered or distributed system. For example, the billing to a customer may be based on a certain level of service availability being provided over a lease period. It is therefore important for the service provider to be able to project the availability of the system.
Service provisioning for the resource manager is complicated by the fact that the available resources may vary with time. For example, there may be a change in configuration, or a particular hardware resource may fail due to some internal problem. In this latter case, the machine may need a service operation in order to repair, replace or upgrade the resource (hardware or software), and this may then render certain other hardware resources within the machine unavailable while the service operation is in progress. Similarly, an upgrade to the operating system of the machine may be followed by a system reboot, during which time the hardware resources of the system may be unavailable. One or more hardware resources might also be taken off-line to facilitate testing or servicing of the electrical power circuits, air conditioning facilities and so on for a machine. The resource manager therefore has the difficult task of matching the available resources against the requirements of various applications on a dynamic basis to address factors such as availability and capacity.