The world today is inundated with systems that assist humans in almost every aspect of their lives. The term “system” is defined by Webster's as “a regularly interacting or interdependent group of items forming a unified whole.” Security systems protect buildings from intruders; traffic control systems control the flow of traffic on roads and highways; navigation systems assist operators in navigation of something as simple as an automobile to something as complex as the space shuttle; price look-up systems help us move more quickly through a grocery store check-out line; computer systems frequently run all of the above-mentioned systems.
A prime example of systems requiring high degrees of reliability are telecommunications systems. Telecommunications switching and transmission systems consist of multiple processing elements, must support fast transport and signaling applications, and perform complex tasks such as protocol conversion and signaling. Due to the complexity of the tasks performed by such systems and the need for error-free communications, particularly at very high speeds, “almost perfect” availability is practically required. To achieve this high level of availability, it would be optimal to adopt in these systems methods and techniques that assure detection of silent failures before they occur, and based on trends that are available for a particular system. However, prior art detection schemes do not provide adequate measures for achieving this high level of system availability.
Because there is such a great reliance on the correct operation of systems in our daily lives, the traditional approach towards designing a highly reliable system (e.g., one that is available to operate 99.99% of the time) has been to build redundancy into the architecture and to design a sufficiently rich set of failure diagnostic tools so that the redundancy can be utilized to recover from simple failures. The benefit of redundancy is only realized, however, when the system is able to detect a failure in the system and successfully “promote” (put into operation) a redundant system element (referred to herein as a “Standby Unit” or “SU”) to take the place of a failed primary system element (referred to herein as an “Active Unit” or “AU”). Systems employing this technique are sometimes referred to as “active-standby systems.” An “active-active” system is similar, except that in an active-active system, there are two active units, each performing their own tasks, and upon failure of one of the active units, the other one takes on the tasks of the first one.
In practice, it is not possible to guarantee that failure diagnostic tools will detect 100% of all unit failures. The reality is that some failure modes cannot be detected without human observation, and the coverage of failure diagnostic tools will typically be somewhat less than 100%. The percentage of unit failures that are covered by system failure diagnostic tools is termed the “coverage factor”. Coverage is a concept that applies to both hardware and software related failures.
Experience has shown that as a product evolves through multiple releases, coverage on the order of 95% can be achieved. However, coverage can be considerably lower in early releases of a product. A good example is the evolution of a software system and the software-related failures connected therewith. With software related failures, field experience with the product (e.g., actual use by consumers) is necessary to achieve high coverage. Failures that escape (i.e., are not identified by) system failure diagnostic tools are called “silent failures.” Silent failures require human observation to be detected and human intervention to be corrected. Human observation can come in the form of routine system checks by conscientious technicians or, in the worst case, by customer complaints indicating that the system is not working. Human intervention can come in the form of sending a technician to a facility to investigate the system up close or in the form of software debugging by a programmer. The recovery duration from silent failures is therefore highly variable. When a system experiences a silent failure, prolonged down time can occur, and prolonged down time can translate to huge monetary and operational losses for both the system operators/sellers and for the users of the system.
It would be beneficial to reduce the time it takes to detect silent failures in systems of all kinds. However, prior art mechanisms for reducing the time required to detect silent failures are inadequate.