As systems age, the components within them are increasingly prone to faults. Faults can occur both in the hardware domain and the software domain. Examples of this in the hardware domain are failure of memory, disks, processor cache and so on. In the software domain, which includes system software such as operating systems and application or middleware software, software performance can decrease because of the impact of new hardware on the existing software. In addition, the probability of a software error in a less used path increases with time. Failures such as memory leaks also become more evident over time.
A reactive approach is the conventional approach to the above problem, in which corrective action is only taken after faults have occurred.
However, some faults can be corrected when the system is on-line. Examples of these are processor cache replacement (current day processors are manufactured with extra cache lines as a contingency to replace faulty cache lines), process restarts or migration to another processor due to a decrease in processing power. The objective of performing fault correction is to keep systems and ultimately business processes running, without human interaction. However, these corrections come with an overhead and have an adverse impact on system performance. The greater the number of faults, the greater the impact on applications and the services and processes they support, for example business processes, in terms of Quality of Service (QoS).
A proactive approach to fault correction in a system, in which fault prediction leads to proactive corrective action in advance of an actual fault developing, has significant advantages over the reactive approach, leading, for example, to maintenance of quality of service above a desired level. However, to implement this approach, there needs to be some way of determining when the quality of service provided by the overall system is likely to fall below the desired or agreed service levels.
Traditional methods of fault tolerance use duplication of resources to a very large extent to drastically reduce the non-availability of the system. For example, if the probability of availability of a machine is 0.96, deploying two machines of the same kind reduces the probability of non-availability from 0.04 to 0.0016. Such methods of fault tolerance do not scale well in large environments, since duplication of all resources is practically not possible. Furthermore, the availability specifications provided by the manufacturer are production time metrics, often based on an average, and are not indicative of the failure of a specific component in a specific operational environment.