Conventional computer systems run at sufficiently high speeds and are sufficiently complex that, when system errors or system failures occur, it is very difficult to determine the nature of the error or failure. Minor errors in a computer system are corrected or ignored by the computer system without being particularly noticeable to the outside world. It is only when many system errors occur that a system user becomes aware of the system's deterioration. Often, the first time a system user becomes aware of the deterioration of a computer system is when so many small errors have occurred that the computer system suffers a loss of data or a fatal system error.
In many cases, the analysis and diagnosis of computer system errors or breakdowns is sufficiently time consuming and expensive that it is more economical to simply throw away a part or even an entire computer system than to attempt to identify failed components and replace them. Of course, disposing of systems that could readily be repaired if diagnosed represents a considerable waste of resources. Accordingly, it would be desirable to develop a low cost system capable of identifying problems within computer systems so that failing may occur during operation so that analysis would not have to be attempted on an already failed computer system. Failed computer systems may not be readily susceptible to post-failure analysis because of the overall complexity of the computer system and because the computer system must be nearly operational to function to any extent.
A serious difficulty with the failure of conventional computer systems is the expense of such failures. Even very small computer systems can perform mission critical tasks such as functioning as network servers or storing critical data. The failure of a computer system performing such a critical function can be very expensive. To address these problems, various redundancy schemes have been implemented, including redundant hard disk assemblies and entire redundant or mirrored processing systems. Such mirrored processing systems are typified by that described in U.S. Pat. No. 5,153,881 to Bruckert, et al., entitled "Method of Handling Errors in Software." In addition, a variety of fault tolerant strategies have been implemented in the operating system software used to control computer systems. For example, conventional network servers have been developed using both hardware redundancy and software based fault tolerance. Each of these strategies has drawbacks. The addition of redundant hardware increases the expense of a computer system and can greatly reduce the flexibility of the system. Software solutions, including various fault tolerant designs, have had limited success and also reduce the flexibility of the overall computer system. More importantly, software is increasingly a primary source of computer errors. Accordingly, it is undesirable to place excess reliance on software for ensuring the integrity and future operability of a mission critical computer system.