Computing devices have various established means of detecting, reporting and frequently correcting particular classes of errors before they can do harm to the computing device or active processes running on the computing device. The corrected errors, by themselves, are ordinarily harmless to the jobs being performed by the computer. However, when the corrected errors increase in frequency they can be used to predict a future error that would be uncorrectable and would force executing computer processes to come to an abrupt and unscheduled halt. The harm from such an unexpected termination of processes can be as simple as having lost the value of all the computer jobs in progress. The errors can also lead to the creation of and propagation of bad results that will lead to even greater levels of harm. In other cases, a high rate of errors can indicate that a part of the computing device is inefficient such that an end-user should be informed of the situation so that replacing or removing the part can be undertaken if desired.
Beyond correcting individual errors as they occur, computing devices may contain built-in features for preventing the potential uncorrectable harmful error, or reducing, if not eliminating, that possibility. The cost of the built-in features is often a loss of functionality, performance or a monetary cost that is justifiable because of the potential harm from the possibility of a fatal error. Such built-in features include but are not limited to built-in deactivation of the faulty component or functional unit, substitution of a component, or reduction of voltage or frequency to the entire system. Another form of prevention is the informing of an end-user, such as by message logging, alert or similar means that such a problem is predicted. Once notified the end user can take external action such as gracefully powering down the computing device and hand replacing critical units. Each such error event type may have a different means of prevention and a different critical frequency or threshold value that may vary in range from events per hour to events per year. The threshold value at which such type of error becomes of concern, in general, will be based on additional factors such as environment and toleration of risk.
Unfortunately however, computing devices generally do not have built-in methods of determining when such critical actions should be taken, nor built-in methods for determining the error rates that are of concern very easily. Further, the computing devices are not able to link a frequency of events to cause a specific remedial action that is built into the hardware. Such actions would require that the computing devices be able to fully determine the problem component, calculate an error rate, and have a means of establishing various thresholds prior to directing critical actions to the specific units. For the majority of error types that can be reported today and because of the various methods of implementing computer components and functional units, these tasks are not easily achieved directly by the computer hardware itself. Further addressing the issue at the operating system (OS) level may result in an unacceptable loss of efficiency and introduces issues related to OS choice, policies and dependencies.