A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates to the field of electronic commerce (e-commerce) and particularly to electronic systems in capital markets and other e-commerce applications with high availability and scalability requirements.
Historically, mission critical applications have been written for and deployed on large mainframes, typically with built-in (hardware) or low-level operating system (software) fault-tolerance. In some prior art, such fault-tolerance mechanisms include schemes where multiple central processing units (CPUs) redundantly compute each operation and the results are used using a vote (in the case of three-way or more redundancy) or other logical comparisons of the redundant outcomes in order to detect and avoid failures. In some cases a fault-stop behavior is implemented where it is preferred to stop and not execute a program operation when an error or other undesired condition will result. This fault-stop operation helps to minimize the propagation of errors to other parts of the system. In other implementations, elaborate fault recovery mechanisms are implemented. These mechanisms typically only recover hardware failures since application failures tend to be specific to the particular application software. To detect errors in application software, vast amounts of error-handling code have been required. Certain financial applications have devoted as much as 90% to error detection and correction. Because of the enormous complexity of such software applications, it is nearly impossible to entirely eliminate failures that prevent the attainment of reliable and continuous operation.
Increasingly, systems need to be available on a continuous basis, 24 hours per day, 7 days per week (24/7 operation). In such nonstop environments it is undesirable for a system to be unavailable when system components are being replaced or software and hardware failures are detected. In addition, today""s applications must scale to increasing user demands that in many cases exceed the processing capabilities of a single computer, regardless of size from small to mainframe. When the system load cannot be handled on a single machine, it has been difficult and costly to obtain a larger machine and move the application to the larger machine without downtime. Attempts to distribute work over two or more self-contained machines is often difficult because the software typically has not been written to support distributed computations.
For these reasons, the need for computational clusters has increased. In computational clusters, multiple self-contained nodes are used to collaboratively run applications. Such applications are specifically written to run on clusters from the outset and once written for clusters, applications can run on any configuration of clustered machines from low-end machines to high-end machines and any combination thereof. When demand increases, the demand is easily satisfied by adding more nodes. The newly added nodes can utilize the latest generation of hardware and operating systems without requiring the elimination or upgrading of older nodes. In other words, clusters tend to scale up seamlessly while riding the technology curve represented in new hardware and operating systems. Availability of the overall system is enhanced when cluster applications are written so as not to depend on any single resource in the cluster. As resources are added to or removed from a cluster, applications are dynamically rescheduled to redistribute the workload. Even in the case where a significant portion of the cluster is down for service, the application can continue to run on the remaining portion of the cluster. This continued operation has significant advantages particularly when employed to implement a cluster-based component architecture of the type described in the above-identified cross-referenced application entitled MARKET ENGINES HAVING EXTENDABLE COMPONENT ARCHITECTURE.
While clustering technology shows promise at overcoming problems of existing systems, there exists a need for practical clustering systems. In practical clustering systems, it is undesirable for each application in a cluster system to manage its own resources. First, it is inefficient to have each application solve the same resource management problems. Second, scheduling for conflict resolution and load-balancing (which is important for scalability) is more effectively solved by a common flexible (extensible) resource manager that solves the common problem once, instead of solving the problem specifically for each application. Furthermore, failure states tend to be complex when each application behaves differently as a result of failures and with such differences, it is almost impossible to model the impact of such failures from application to application running on the cluster. To overcome these problems, commercial and academic projects have arisen with the objective of providing a clustering architecture that provides isolation between physical systems and the applications they execute.
To date, however, proposed clustering architectures are complex and can only handle a limited number of specific system failures. In addition, proposed clustering software does not appropriately scale up across multiple sites. There is a need, therefore, for a simple and elegant clustering architecture that includes fault-tolerance and load-balancing, that is extendable over many computer systems and that has a flexible interface for applications. In such an architecture, the number of failure states needs to be kept low so that extensive testing is possible to render the system more predictability. Hardware as well as software failures need to be detected and resources need to be rescheduled automatically, both locally as well as remotely. Rescheduling needs to occur when a particular application or resource is in high demand. However, rescheduling should be avoided when unnecessary because rescheduling can degrade application performance. When possible, rescheduling should only occur in response to resource shortages or to avoid near-term anticipated shortages. If the system determines that resource requirements are likely to soon exceed the capacity of a system element, then the software might appropriately reschedule to avoid a sudden near-term crunch. The result of this xe2x80x9canticipatoryxe2x80x9d rescheduling is avoidance of resource bottlenecks and thereby improvement in overall application performance. The addition and removal of components and resources needs to occur seamlessly in the system.
In view of the above background, it""s an object of the present invention to provide an improved fault-tolerance framework for an extendable computer architecture.
The present invention is computer system having a fault-tolerance framework in an extendable computer architecture. The computer system is formed of clusters of nodes where each node includes computer hardware and operating system software for executing jobs that implement the services provided by the computer system. Jobs are distributed across the nodes under control of a hierarchical resource management unit. The resource management unit includes hierarchical monitors that monitor and control the allocation of resources.
In the resource management unit, a first monitor, at a first level, monitors and allocates elements below the first level. A second monitor, at a second level, monitors and allocates elements at the first level. The framework is extendable from the hierarchy of the first and second levels to higher levels where monitors at higher levels each monitor lower-level elements in a hierarchical tree. If a failure occurs down the hierarchy, a higher level monitor restarts an element at a lower level. If a failure occurs up the hierarchy, a lower-level monitor restarts an element at a higher level. While it may be adequate to have two levels of monitors to keep the framework self-sufficient and self-repairing, more levels may be efficient without adding significant complexity. It is possible to have multiple levels of this hierarchy implemented in a single process.
In some embodiments, each of the monitors includes termination code that causes an element to terminate if duplicate elements have been restarted for the same operation. The termination code in one embodiment includes suicide code whereby an element will self-destruct when the element detects that it is an unnecessary duplicate element.
In one local level embodiment, the resource management unit includes agents as elements in the first level where the agents monitor and control the allocation of jobs to nodes and includes a local coordinator in the second level where the local coordinator monitors and controls the allocation of jobs to agents. Also, the agents monitor the local coordinator. Failure of a job results in the monitoring agent for the failed job restarting a job to replace the failed job. Failure of an agent results in the monitoring agent for the failed agent restarting of an agent to replace the failed agent. Failure of the local coordinator results in restarting of a local coordinator to replace the failed local coordinator. In a particular example of a local level embodiment, the agents are implemented as host agents where a host agent only monitors the jobs running on one node.
In a higher level hierarchy, one or more group coordinators are added at a group level above the local level where each group coordinator monitors and controls multiple local coordinators where each local coordinator monitors and controls lower level agents which in turn monitor and control lower level jobs.
In a still higher level hierarchy, one or more universal coordinators are added at a universal level above the group level where each universal coordinator monitors and controls multiple local coordinators where each local coordinator monitors and controls lower level agents which in turn monitor and control lower level jobs.
The present computer system gives highest priority to maintaining the non-stop operation of important elements in the processing hierarchy which, in the present specification, is defined as operations that are jobs. While other resources such as the computer hardware, computer operating system software or communications links are important for any instantiation of a job that provide services, the failure of any particular computer hardware, operating system software, communications link or other element in the system is not important since upon such failure, the job is seamlessly restarted using another instantiation of the failing element. The quality of service of the computer system is represented by the ability to keep jobs running independently of what resource fails in the computer system by simply transferring a job that fails, appears to have failed or appears that failure is imminent and such transfer is made regardless of the cause and without necessarily diagnosing the cause of failure.
The present computer system utilizes redundancy of simple operations to overcome failures of elements in the system. The redundancy is facilitated using hierarchical monitors that decouple fault-tolerance processes for monitoring failure from the services (executed by application programs that are implemented by jobs).
An indication of progress of a service is determined by using, in applications that provide a service, the capability of processing progress messages. The progress messages traverse the vital paths of execution of the service before returning a result to the progress monitor. The progress monitor is independent of the fault-tolerance layer and does not interfere with fault-tolerant operation. Restart of failing jobs is simple and quick without need to analyze the cause of failure or measure progress of the service.
The present computer system inherently provides a way to seamlessly migrate operation to new or different hardware and software. Because the present computer system inherently assigns jobs among available resources and automatically transfers jobs when failures occur, the same dynamic transfer capability is used seamlessly, maintaining non-stop operation, for system upgrade, system maintenance or other operation where new or different hardware and software are to be employed.
The present computer system operates such that if any element is in a state that is unknown (such as a partial, possible or imminent failure) then the fault-tolerant operation reacts by assuming a complete failure has occurred and thereby immediately forces the system into a known state. The computer system does not try to analyze the failure or correct the failure for purposes of recovery, but immediately returns to a known good state and recalculates anything that may have happened since the last known good state.
The present computer system works well in follow-the-sun operations. For example, the site of actual processing is moved from one location (for example, Europe) to another location (for example, US) where the primary site is Europe during primary European hours and the primary site is US during primary US hours. Such follow-the-sun tends to achieve better performance and lower latency. The decision of when to switch over from one site to another can be controlled by a customer or can be automated.
The present system includes an interface that collects and provides output information and receives input information and commands that allow humans to monitor and control the computer system and each of the components and parts thereof. The interface logs data and processes the logged data to form statistics including up-time, down-time, failure, performance, configuration, versions, through-put and other component and system information. The interface provides data for system availability measurements, transaction tracking and other information that may be useful for satisfying obligations in service agreements with customers.
The present system provides, when desired, customer process isolation. For example, first jobs running on first nodes associated with a first customer are isolated from second jobs associated with a second customer running on second nodes, where the second nodes are different from the first nodes.