Enterprises depend on the availability of the systems supporting their day-to-day operations. A system is called available if it is up and running and is producing correct results. In a narrow sense, availability of a system is the fraction of time it is available. MTBF denotes the mean time before failure of such a system is denoted, i.e. the average time a system is available before a failure occurs (this is the reliability of the system). Ideally, the availability of a system is 1. Today, a system can claim high availability if its availability is about 99.999% (it is called fault tolerant if its availability is about 99.99%). J. Gray and A. Reuter, “Transaction processing: Concepts and Techniques”, San Mateo, Calif.: Morgan Kaufmann 1993 give further details on these aspects.
Availability of a certain system or application has at least two aspects: in a first, narrow significance it relates to the question, whether a certain system is active at all providing its services; in a second, wider significance it relates to the question, whether this service is provided in a timely fashion offering a sufficient responsiveness.
As outlined in further details by D. Loshin, “High performance computing demystified”, Academic Press, Inc., 1994 and K. Hwang, Advanced computer architecture: Parallelism, Scalability, Programmability, PMcGraw-Hill, Inc., 1993 and J. Gray and A. Reuter, “Transaction processing: Concepts and Techniques”, San Mateo, Calif.: Morgan Kaufmann 1993 one fundamental mechanism to improve availability is based on “redundancy”: The availability of hardware is improved by building clusters of machines and the availability of software is improved by running the same software in multiple address spaces.
With the advent of distributed systems, techniques have been invented which use two or more address spaces on different machines running the same software to improve availability (often called active replication). Further details on these aspects may be found in S. Mullender, “Distributed Systems”, ACM Press, 1993. In using two or more address spaces on the same machine running the same software which gets its request from a shared input queue the technique of warm backups is generalized by the hot pool technique.
C. R. Gehr et al., “Dynamic Server Switching for Maximum Server Availability and Load Balancing”, U.S. Pat. No. 5,828,847 teaches a dynamic server switching system relating to the narrow significance of availability as defined above. The dynamic server switching system maintains a static and predefined list (a kind of profile) in each client which identifies the primary server for that client and the preferred communication method as well as a hierarchy of successively secondary servers and communication method pairs. In the event that the client does not have requests served by the designated primary server or the designated communication method, the system traverses the list to ascertain the identity of the first available alternate server-communication method pair. This system enables a client to redirect requests from an unresponsive server to a predefined alternate server. In this manner, the system provides a reactive server switching for service availability.
In spite of improvements of availability in the narrow sense defined above, this teaching suffers from several shortcomings. Gehr's teaching provides a reactive response only in case a primary server could not be reached at all. There are no proactive elements which prevent a client requests service from a non-responsive server. As the list of primary and alternate servers is statically predefined, there may be situations in which no server could be found at all or in which a server is found not before several non-responsive alternate servers have been tested. Moreover, Gehr's teaching does not allow for a dynamic workload balancing improving the availability in the wider sense, i.e. the responsiveness. According to Gehr, different clients might be controlled by different lists of servers, which allow for a rudimentary and static workload balancing as different clients might send their requests to different servers. In a highly dynamic, worldwide operating network situation, where clients and servers permanently enter or leave the network and where the access pattern to the servers may change from one moment to the next, Gehr's teaching to improve the responsiveness is not adequate.
Another area of technology to be mentioned is the area of Transaction Processing monitors (TP monitors). TP monitors have been invented more than three decades ago to make effective use of expensive system resources (J. Gray and A. Reuter, “Transaction processing: Concepts and Techniques”, San Mateo, Calif.: Morgan Kaufmann 1993): Ever increasing numbers of users had to be supported by a system, and it turned out that native operating system functionality did not suffice to allow this. A TP monitor as a layer on top of the operating system manages system resources at a much finer granularity, assigns them with care, and only if needed and only for the duration needed. As a result, for one and the same machine and operating system, a given application can support orders of magnitudes of more users when implemented in a TP monitor than when implemented based on native operating system features.
The very complex and sophisticated TP monitor technology is primarily limited to a certain server only and thus does not solve the availability problem of a distributed network of application servers.
With the advent of distributed systems supporting middleware, object request brokers (be it a CORBA implementation, or DCOM, or based on the Java beans model) favor commodity cluster environments, i.e. environments which are composed out of relatively cheap hardware; refer for instance to G.F. Pfister, In search of clusters—2nd edition (Prentice Hall PTR, 1998). In such environments, the service providing software components are simply replicated on multiple machines to ensure scalability. But this requires a mechanism to assign service requests to the various service providers ensuring the effective exploitation of the cluster resources. As a consequence, the middleware implementing the distributed system has to deal with similar problems as traditional TP monitors did before (by the way, this is one of the reasons why such systems are considered as “TP monitor like systems” today). In fact, the middleware is representing a single and central TP monitor.
These “TP monitor like systems” are much too complicated to develop as well as to administer. Moreover they themselves consume a significant amount of processing resources.
Despite of all of this progress, further improvements are urgently required supporting enterprises in increasing the availability of their applications and allowing for instance for electronic business on a 7 (days) * 24 (hour) basis; due to the ubiquity of worldwide computer networks at any point in time somebody might have interest in accessing a certain application server.