A. System Availability
As individuals and companies become more dependent upon computers in their daily lives, the reliability of these systems becomes even more important. There are several metrics that can be used to characterize reliability. The most common are:
1. Mean time before failure (MTBF)—The average time that a system will be operational before it fails.
2. Mean time to repair (MTR)—The average time that it takes to restore a failed system to service.
3. Availability (A)—The proportion of time (or the probability) that the system will be operational.
These metrics are simply related by
                    A        =                  MTBF                      MTBF            +            MTR                                              (        1        )            
That is, A is the proportion of total time (MTBF+MTR) that the system is operational (MTBF). (1−A) is therefore the proportion of time that the system will be down. For instance, if the system is operational for an average time of 4000 hours (MTBF=4000) and requires 2 hours for repair (MTR=2), then A=4000/4002=0.9995. That is, the system is expected to be operational 99.95% of the time, and will be out of service 0.05% of the time.
High availabilities are more easily described in terms of their “9s.” For instance, a system with an availability of 99.9% is said to have an availability of three 9s. A system with an availability of 99.998% is said to have an availability of a little less than five 9s, and so forth.
The number of 9s are related to down time as follows:
TABLE 19s and Down TimeNines% AvailableHours/YearMinutes/Month299%87.6438399.9%8.7644499.99%.884.4599.999%.09.44699.9999%.01.04
Windows® NT servers are now reporting two 9s or better. Most high-end UNIX servers are striving for three 9s, while HP NonStop® Servers and IBM Sysplex® systems are achieving four 9s.
These concepts are further described in Highleyman, W. et al., “Availability,” Parts 1 through 5, The Connection, Volume 23 No. 6 through Volume 24, No. 4, 2002, 2003.
B. System MTBF
From Equation (1), the system mean time before failure, MTBF, can be expressed as a function of A:
  MTBF  =            A              1        -        A              ⁢    MTR  
Since A is typically very close to one, MTBF can be closely approximated by
                    MTBF        ≈                  MTR                      1            -            A                                              (        2        )            
The system mean time to repair, MTR, is usually a function of service agreements and repair capability and can be considered fixed. Therefore, MTBF is inversely proportional to the quantity (1−A) which is the probability of system failure. If the probability of failure can be cut in half, the system's mean time before failure can be doubled.
C. Current High-Availability Architectures
The most reliable systems such as the HP NonStop Servers achieve their high reliability by “n+1 sparing.” That is, every critical component is replicated n+1 times, and can function unimpeded (except for perhaps its processing capacity) if at least n instances of a critical component are functioning. That is, such a system can tolerate any single failure and continue in operation. However, more than one failure can potentially (though not necessarily) cause the system to fail. Critical components include processors, disks, communication lines, power supplies and power sources, fans, and critical software programs (referred to as processes hereafter).
These systems can achieve availabilities in the order of four 9s.
D. Replicating Systems for Availability
As can be seen from Table 1 above, a system with an availability of four 9s can be expected to be down almost an hour a year. In cases where this amount of down time is unacceptable, the systems may be replicated. That is, a hot standby is provided. The active system provides all of the processing for the application and maintains a nearly exact copy of its current database on the standby system. If the active system fails, the standby system can (almost) immediately assume the processing load.
It can be shown that replicating a system (e.g., adding a node with np processors thereby causing the system to go to 2np processors as in a disaster recovery scenario) doubles its 9s. Thus, for instance, one could build a replicated system from two UNIX systems, each with three 9s availability (8.8 hours downtime per year) to achieve an overall system availability of six 9s (32 seconds downtime per year).
E. What is Needed
For many applications, downtimes in the order of hours per year are unacceptable or even intolerable. The cost of downtime can range from $1,000 per hour to over $100,000 per hour. If a Web store is down often, customers will get aggravated and go to another Web site. If this happens enough, lost sales will quickly turn into lost customers.
If a major stock exchange is down for just a few minutes, it will make the newspapers. If a 911 system is down for a few minutes, the result could be the loss of life due to a cardiac arrest or a building destroyed by fire. The cost of a few seconds of down time in an in-hospital patient monitoring system could be measured in lives rather than in dollars.
Replicating systems as described above can dramatically improve system availability. However, some of these systems are quite expensive, costing millions of dollars. To provide a standby system costing this much is often simply not financially feasible.
What is needed is a method for substantially achieving the availability of a replicated system at little if any additional cost. The present invention fulfills such a need.