Computer systems employ microprocessors that interpret software or program code instructions extremely fast. Rapid decisions and results follow in split-second chain reaction fashion. Sometimes the program code being interpreted may be read incorrectly, due to electronic noise for example, and in such a case an error may be flagged which slows down the computers operation. On the other hand the code may be interpreted to mean something other than that which it really means and an erroneous action results, which in turn may lead to other such erroneous actions. In any case such misinterpreted code will typically lead to expensive delays and costs associated with service calls and possibly repairs.
Sometimes service calls are called due to erroneously flagged errors. At the time of service, it may not be apparent that the service call is due to an error. The error message may then dictate that a component such as an electronics board or hard disk drive be replaced. In such a case the removed component may be later tested to determine its cause of failure. But the testing would probably reveal no trouble found. In this case the component would be destroyed or scrapped as its quality reputation has been severely damaged even when it in fact is fine. In this case, there would have been costly delays to normal production because of the flagged errors, a costly service call, an expensive repair and removal of a component that was probably never flawed in the first place. Moreover, the original problem(s) probably are still present in the system and are ready to repeat.
The expensive problems described above may be even more costly in a modern system used for data storage and retrieval. As is known in the art, such computer systems generally include a central processing unit (CPU), a memory subsystem, and a data storage subsystem. According to a network or enterprise model of the computer system, the data storage system associated with or in addition to a local computer system, may include a large number of independent storage devices or disks housed in a single enclosure or cabinet. This array of storage devices is typically connected to several computers over a network or via dedicated cabling. Such a model allows for the centralization of data that is to be shared among many users and also allows for a single point of maintenance for the storage functions associated with the many host processors.
The data storage system stores critical information for an enterprise that must be available for use substantially all of the time. If an error occurs on such a data storage system it must be fixed as soon as possible because such information is at the heart of the commercial operations of many major businesses. A recent economic survey from the University of Minnesota and known as Bush-Kugel study indicates a pattern that after just a few days (2 to 6) without access to their critical data many businesses are devastated. The survey showed that 25% of such businesses were immediately bankrupt after such a critical interruption and less than 7% remained in the marketplace after 5 years.
As has been described above, it is important to focus error recovery procedures on systems that actually require it and extremely wasteful to focus such efforts on systems not actually in need of repair. The same that is true for simple computer systems is true for data storage systems but magnified in importance and scale of potential costs.
Accordingly there is a need in the general computer arts and the more specific data storage and retrieval arts to manage error handling so that false errors are less likely to occur and so that program code is more likely to be interpreted as it is written instead of as another code message that may be similar.