Certain terms used in the “Background of the Invention” are defined in the “Definitions” section below.
1.1 Computer Applications
Much of our daily lives is augmented by computers. The many services upon which we depend, our banking, communications, air and rail travel, online shopping, credit-card and debit-card purchases, mail and package delivery, and electric-power distribution, are all managed by computer applications.
In its simplest form, as shown in FIG. 1, a typical computer application is generally implemented as a computer program (1) running in a computer (2). A computer program is basically a set of computer-encoded instructions. It often is called an executable because it can be executed by a computer. A computer program running in a computer is called a process, and each process has a unique identification known to the computer. Many copies of the same computer program can be running in a computer as separately distinguishable processes.
An application typically includes multiple interacting processes.
1.2 Application Database
With reference to FIG. 1, an application often depends upon a database (3) of information that the application maintains to record its current state. Often, the information in the database is fundamental to the operation of the application, to the decisions it makes, and to its delivery of services to the end users.
The database may be stored in persistent storage such as a disk for durability, it may be stored in high-speed memory for performance, or it may use a combination of these storage techniques. The database may be resident in the same computer as the application program, it may be resident in another computer, it may be implemented as an independent system, or it may be distributed among many systems.
A database generally includes one or more files or tables, though it may be just a random collection of unorganized data. Each file or table typically represents an entity set such as “employees” or “credit cards.” A file comprises records, each depicting an entity-set member such as an employee. A table comprises rows that define members of an entity set. A record comprises fields that describe entity-set attributes, such as salary. A row comprises columns that depict attributes of the entity set. In this specification, “files” are equivalent to “tables;” “records” are equivalent to “rows;” and “fields” are equivalent to “columns.”
1.3 Requests
With further reference to FIG. 1, incoming end users (4) generate requests (5) to be processed by the computer application. End users may be people, other computer applications, other computer systems, or electronic devices such as electric power meters. In this specification, the term “end user” means any entity that can influence an application and/or can request or use the services that it provides.
An example of an incoming request from an end user is a request for a bank-account balance. Another example is an alert that a circuit breaker in a power substation has just tripped. In some cases, there may be no incoming request. For instance, a computer application may on its own generate random events for testing other applications.
1.4 Request Processing
As shown in FIG. 1, the application receives a request from an incoming end user (5). As part of the processing of this request, the application may make certain modifications to its database (6).
The application can read the contents of its database (7). As part of the application's processing, it may read certain information from its database to make decisions. Based on the request received from its incoming end user and the data in its database, the application delivers certain services (8) to its outgoing end users (9).
1.5 Services
A service may be delivered by an application process as the result of a specific input from an end user, such as providing an account balance in response to an online banking query. Another example of a service is the generation of a report upon a request from an end user or a report that is generated periodically.
Alternatively, the application program may spontaneously deliver a service, either on a timed basis or when certain conditions occur. For instance, an alarm may be generated to operations staff if the load being carried by an electric-power transmission line exceeds a specified threshold.
The end users providing the input to the application may or may not be the same end users as those that receive its services.
1.6 Availability
The availability of a computer system and the services it provides is often of paramount importance. For instance, a computer system that routes payment-card transactions for authorization to the banks that issued the payment cards must always be operational. Should the computer system fail, credit cards and debit cards cannot be used by the card holders. They can only engage in cash transactions until the system is repaired and is returned to service.
The failure of a 911 system could result in the destruction of property or the loss of life. The failure of an air-traffic control system could ground all flights in a wide area.
In mission-critical systems such as these, it is common to deploy two or more computer systems for reliability. Should one computer system fail, the other computer system is available to carry on the provision of services.
1.7 Redundant System
The availability of a computing system can be significantly enhanced by providing a second system that can continue to provide services to the end users should one system fail. The two systems can be configured as an active/backup system or as an active/active system. The systems are interconnected via a computer network so they can interact with each other.
In an active/backup system (FIG. 2), one system (the production system) is processing all transactions. It is keeping its backup system synchronized by replicating database changes to it so that the backup system is ready to immediately take over processing should the production system fail.
In an active/active system (FIG. 3), both systems are processing transactions. They keep each other synchronized via bidirectional data replication. When one system processes a transaction and makes changes to its database, it immediately replicates those changes to the other system's database. In that way, a transaction can be routed to either system and be processed identically. Should one system fail, all further transactions are routed to the surviving system.
1.8 The Calculation of Availability
1.8.1 The Prior-Art Calculation of System Availability
There is a large body of analytical techniques to calculate the reliability of a system. These techniques depend upon several parameters, such as the mean (average) time between failures of a single system (MTBF) and the mean (average) time to repair the system (MTR) once it has failed.
A common method to determine the availability of a redundant system uses the estimated MTBF and MTR of each system comprising the redundant system. The availability of a single system is defined as the probability that the system will be operational. If the system experiences a failure on the average of every MTBF hours and requires a time of MTR hours to repair, it will be down MTR/MTBF of the time; and it will be operational (MTBF−MTR)/MTBF of the time. Thus,Availability of a single system=(MTBF−MTR)/MTBF=1−MTR/MTBF  (1)
Let the availability of a single system be represented by a. Then,a=1−MTR/MTBF  (2)
The probability that a single system will be in a failed state is one minus the probability that it is operational. Let f be the probability that a single system is failed:f=1−a=MTR/MTBF  (3)
The probability that both systems in a redundant pair will be failed is the probability that one system has failed AND the probability that the second system has failed. Let the probability of a dual system failure be F. Then, from Equation (3),F=f*f=f2=(1−a)2  (4)
The probability that the redundant system will be operational (that is, at least one of the systems will be operational) is represented by A and isA=1−F=1−(1−a)2  (5)
This is the common expression for the availability of a dually redundant system.
1.8.2 Memoryless Variables
In the above analysis, MTBF and MTR are random variables. That means that the probability of an event occurring in some small interval of time, Δt, is independent of what events have occurred in the past and that the occurrence of an event has no impact on events occurring in the future. The occurrence of an event is unaffected by the occurrence of other events. The variable is said to be memoryless because no event is affected by the occurrence of any other event.
For instance, assume that MTBF, the mean time between failures, is 1,000 hours. If we look at an operating system, we know that on the average, the next failure will occur in 1,000 hours. If we wait 500 hours, the average time to the next failure still will be 1,000 hours.
Likewise, let the average time to repair the system be four hours. When the system fails, it will take an average of four hours to repair it. However, if the system has been under repair for two hours, and if we ask the service technician what is the estimated time to complete the repair, his answer still will be four hours.
Clearly, memoryless variables for MTBF and MTR do not reflect the reality of the real world.
1.8.3 The Exponential Distribution
Random variables are characterized by the exponential distribution. The exponential distribution for MTBF is shown in FIG. 4 as a probability density function pf(t) (1) of the formpf(t)=e−t/MTBF/MTBF  (6)
pf(t) gives the probability that during any time interval Δt, where Δt is arbitrarily small, the system will fail.
As shown in FIG. 4, the probability that the system will fail in the time interval Δt at time ti (2) is piΔt. The average time that it will take the system to fail is the sum of the probabilities that it may fail at any time:
                              Average time for the system to fail                =                              ∑                          i              =              0                        ∞                    ⁢                                    t              i                        ⁢                          p              i                        ⁢            Δ            ⁢                                                  ⁢            t                                              (        7        )            
As Δt approaches zero, the summation becomes an integral:
                              Average time for the system to fail                =                              ∫            0            ∞                    ⁢                                                    tp                f                            ⁡                              (                t                )                                      ⁢            d            ⁢                                                  ⁢            t                                              (        8        )            
Using our expression for pf(t) from Equation (6),
                              Average time for the system to fail                =                                            ∫              0              ∞                        ⁢                                          t                ⁡                                  (                                                            e                                                                        -                          t                                                /                        MTBF                                                              /                    MTBF                                    )                                            ⁢              d              ⁢                                                          ⁢              t                                =          MTBF                                    (        9        )            
Thus, we should expect the system to fail in an average time of MTBF.
If we wait for a time T, then the average time to the next failure is
                              Average time to next failure                =                                            ∫              0              ∞                        ⁢                                          (                                  t                  -                  T                                )                            ⁢                              (                                                      e                                                                  -                                                  (                                                      t                            -                            T                                                    )                                                                    /                      MTBF                                                        /                  MTBF                                )                            ⁢              d              ⁢                                                          ⁢              t                                =          MTBF                                    (        10        )            
The average time to the next failure is still MTBF. Random variables characterized by the exponential distribution are indeed memoryless.
The integral of pf(t) over t gives the probability that the system will fail at some time within the time t. This is the cumulative distribution, Pf(t) (3):
                                          P            f                    ⁡                      (            t            )                          =                                            ∫              0              t                        ⁢                                          (                                                      e                                                                  -                        t                                            /                      MTBF                                                        /                  MTBF                                )                            ⁢              d              ⁢                                                          ⁢              t                                =                      1            -                          e                                                -                  t                                /                MTBF                                                                        (        11        )            
As t becomes large, Pf(t) approaches 1. That is, the probability that the system will fail at some point is 1.
1.9 What is Needed
The prior art for calculating estimated availability from any point in time is flawed because it is based on memoryless random variables. The calculation of the average time to the next failure, MTBF, is always the same regardless of how long a system has been in service.
What is needed are methods to determine the actual availability of components as a function of time. This actual availability then can be monitored, and the MTTF (mean time to failure) for the system (that is, the expected time to the next failure from the current time) can be calculated continuously so that action can be taken should the MTTF fall below a specified threshold.
The MTTF can also be used to estimate the availability of a redundant system. If the system uses staggered starts, the MTTF will be much greater than if the two systems had been started simultaneously.