1. Field of the Invention
The invention generally relates to distributed fault-tolerant, high-availability systems that are especially suited to applications requiring high throughput, scalability, and extremely high availability. The invention relates more particularly to a software architecture that distributes processing load of an application among multiple processors and performs fault recovery and load redistribution.
2. Background
High throughput and/or high availability are the principal concern in various computer applications. Such applications may, for example, carry out complex scientific calculations or control essential services, such as city water purification or power grid control for a large population region. Telecommunications is another prime example. A large telecommunications network with thousands of concurrent users requires very high throughput to handle extensive teleommunations traffic. A telecommunications network that fails because of a computer fault can create widespread havoc and huge economic losses. The degree of fault-tolerance in a telecommunications network should be measured in numbers of hours of down-time over many years of continuous operation, and preferably over decades of continuous operation. Furthermore, the amount of throughput capacity and rate of throughput should not only be as high as current requirements, but also capable of expansion to accommodate future requirements.
In the past, computer systems have provided fault-tolerance capability by using cold standby, hot standby, and warm standby approaches. These systems generally have one active processor and one standby processor for each application. Each of the approaches have advantages and disadvantages that are well understood. Similarly, systems that distribute processing load of an application across multiple processors are also known in the available art. While systems that are based on dual-processor fault-tolerant architecture or multi-processor distributed architecture are known, systems that combine fault-tolerant and distributed capabilities of the available art to achieve higher throughput, reliability, scalability, and effective usage of hardware are not common. Existing systems today that address these higher throughput and reliability issues are very costly and inflexible because of complexity. Such systems are usually based on a specific system hardware architecture assuming a specific vertical and horizontal distribution of applications on the processors. For this reason, reusing such solutions from one platform to another is not possible without redesigning the system, which results in higher system cost. The architecture used by these systems also limits the application operation to one mode—it does not allow different applications to operate in different modes—for example, one application in distributed fault-tolerant mode (n active/1 standby processors or n active/n standby processors) and another application in pure fault-tolerant mode (1 active/1 standby processor). A uniform software architecture capable of handling such high throughput with such high availability and addressing the aforementioned issues of existing systems is very cost effective and drastically reduces the overall system development time. This type of architecture could be useful to a large number of equipment vendors and service providers as well as to others who need such extreme requirements. Thus, such a computer application software architecture must adapt to a variety of different computer hardware platforms and to a variety of different computer operating systems. Furthermore, it must be modular, open, flexible, and designed to permit simple and expeditious customization. It must allow seamless integration into a provider's system, regardless of the hardware platform and operating system. Based on these requirements, there is no existing available art that has the aforementioned attributes needed for certain demanding applications. Furthermore, a software architecture meeting all of the aforementioned requirements would be highly advantageous.
In addition to the features above, the invented architecture provides other features unknown in the available art. These features include the ability to recover from multiple software and hardware failures in distributed systems, to provide dynamic load balancing and load redistribution when a processor fails or is dynamically introduced into an operational system.
To further explain the invented architecture, the general concepts and terms used in the description are defined below. Concepts specific to the invention are described in the detailed description of the invention.
General Concepts and Terms
The term application refers to any program that is not part of the system software or architecture software.
The term user application denotes an application that uses the services of some other application. In the description, the terms service user and user application are used interchangeably.
The term provider application denotes an application that provides the service to another application. In the description, the terms service provider and provider application are used interchangeably.
The term architecture component denotes a software component that is required by and supplied as part of the invented Distributed Fault-Tolerant/High-Availability architecture.
The term software component refers to a component of a node or processor. A software component may be an application, a software component of the architecture, or a component of the system software.
The term system software denotes a software component that provides operating system services—for example, memory management, timer management, inter/intra processor communication, etc.
The terms processor and node are used interchangeably to mean an executable or binary image containing one or more applications and required system software. This executable must have, but is not limited to, the following attributes:                The executable must contain one or more computer application(s).        The executable must contain system software providing system services required by the application to operate.        The executable must contain software components required by the Distributed Fault-Tolerant/High-Availability architecture.        Software components contained within the executable must be able to exchange information with software components contained within other such executables.        
Each such executable must have a unique, globally-known address, which is used to reference the executable. This address is known as a processor identifier.
The terms interface and API are used interchangeably to denote a collection of functions presented by a software component. Functionality provided by the software component can be accessed via functions defined and provided on the interface. These functions are called interface functions.
The term entity identifier is used to refer to the unique and globally-known name or address of a software component. An entity is the name of a software component and does not reflect or refer to any particular copy of the software component in the system.
The term fault refers to a defect in a software/hardware component with the potential to cause a failure in the system.
The term failure indicates incorrect behavior of a system due to the presence of a fault. A failure of a system occurs when the behavior of the system deviates from the specified behavior of the system.
The following references provide further information and are hereby incorporated by reference:    A Conceptual Framework for System Fault Tolerance (Technical Report), Walter L. Heimerdinger sand Charles B. Weinstock, Software Engineering Institute (CMU/SEI-92-033).    Distributed Systems (2e), Sape Mullender, Addison-Wesley, 1993.    Fault Injection Techniques and Tools, Mei-Chen Hsueh et al, April 1997, IEEE Computer.    Fault Tolerance in Distributed Systems, Pankaj Jolote, PTR Prentice Hall, 1994.    Software-Based Replication for Fault Tolerance, Rachid Guerraoui and Andre Schiper, April 1997, IEEE Computer.