Fault Tolerance is generally regarded as the ability to mask, or recover from, erroneous conditions in a system once an error has been detected. Fault tolerance is typically required for mission critical systems/applications. Mission critical typically refers to any indispensable operation that cannot tolerate intervention, compromise or shutdown during the performance of its critical function, e.g. any computer process that cannot fail during normal business hours. Exemplary mission critical environments include business-essential process control, finance, health, safety and security. These environments typically monitor, store, support and communicate data that cannot be lost or corrupted without compromising their core function.
One exemplary environment where fault tolerance is essential is in financial markets, and in particular, electronic financial exchanges. The systems that implement an electronic exchange receive and match orders and otherwise consummate trades so as to implement the marketplace and facilitate the exchanges therein. Consistent reliable operation is critical to ensuring market stability, reliability and acceptance.
Fault-tolerant typically describes a computer system or component designed so that, in the event that a component fails, a backup component or procedure can take its place with substantially little or no loss of service. Fault tolerance may be provided with software, or embedded in hardware, or provided by some combination. For example, in a software implementation, the operating system may provide an interface that allows a programmer to “checkpoint” critical data at pre-determined points within a transaction. In a hardware implementation, the programmer need not be aware of the fault-tolerant capabilities of the machine. For example, at a hardware level, fault tolerance may be achieved by duplexing each hardware component, e.g. disks are mirrored, multiple processors are “lock-stepped” together and their outputs are compared for correctness, etc. When an anomaly occurs, the faulty component is determined and taken out of service, but the machine continues to function as usual.
The level of fault tolerance that is required is typically defined by the needs of the system requirements, i.e. specifications that clearly state acceptable behavior upon error, e.g. do errors need to be detected and corrected or merely detected, and how quickly must such actions be taken?
One method of providing fault tolerance to a system is to add redundancy to one or more of the critical components of the system. Redundant describes computer or network system components, such as fans, hard disk drives, servers, operating systems, switches, and/or telecommunication links that are installed to back up primary resources in case they fail. Three types of redundancy schemes are commonly used for this purpose:
One-for-N (1:N)—There is one standby component for every N active component.
One-for-one (1:1)—There is a standby component for each active component.
One-plus-one (1+1)—This is similar to the one-for-one scheme except that in the case of one-plus-one, traffic is transmitted simultaneously on both active and standby components. (Traffic is generally ignored on the standby.) An example of one-plus-one redundancy is the 1+1 SONET/SDH APS scheme that avoids loss of data traffic caused by link failure.
When providing redundant operation for processing components, voting logic may be used to compare the results of the redundant logic and choose which component is correct. For example, in Triple Mode Redundancy, three redundant components may be provided wherein if the result of one component fails to match the other two, which match each other, the ultimate result will be that of the two components that matched.
A well-known example of a redundant system is the redundant array of independent disks (“RAID”). RAID (originally redundant array of inexpensive disks) is a way of storing the same data in different places (thus, redundantly) on multiple hard disks. By placing data on multiple disks, I/O (input/output) operations can overlap in a balanced way, improving performance. Since multiple disks increases the mean time between failures (MTBF), storing data redundantly also increases fault tolerance. A RAID appears to the operating system to be a single logical hard disk. RAID employs the technique of disk striping, which involves partitioning each drive's storage space into units ranging from a sector (512 bytes) up to several megabytes. The stripes of all the disks are interleaved and addressed in order. In a single-user system where large records, such as medical or other scientific images, are stored, the stripes are typically set up to be small (perhaps 512 bytes) so that a single record spans all disks and can be accessed quickly by reading all disks at the same time. In a multi-user system, better performance requires establishing a stripe wide enough to hold the typical or maximum size record. This allows overlapped disk I/O across drives.
There are at least nine types of RAID plus a non-redundant array (RAID-0):                RAID-0: This technique has striping but no redundancy of data. It offers the best performance but no fault-tolerance.        RAID-1: This type is also known as disk mirroring and consists of at least two drives that duplicate the storage of data. There is no striping. Read performance is improved since either disk can be read at the same time. Write performance is the same as for single disk storage. RAID-1 provides the best performance and the best fault-tolerance in a multi-user system.        RAID-2: This type uses striping across disks with some disks storing error checking and correcting (ECC) information. It has no advantage over RAID-3.        RAID-3: This type uses striping and dedicates one drive to storing parity information. The embedded error checking (ECC) information is used to detect errors. Data recovery is accomplished by calculating the exclusive OR (XOR) of the information recorded on the other drives. Since an I/O operation addresses all drives at the same time, RAID-3 cannot overlap I/O. For this reason, RAID-3 is best for single-user systems with long record applications.        RAID-4: This type uses large stripes, which means you can read records from any single drive. This allows you to take advantage of overlapped I/O for read operations. Since all write operations have to update the parity drive, no I/O overlapping is possible. RAID-4 offers no advantage over RAID-5.        RAID-5: This type includes a rotating parity array, thus addressing the write limitation in RAID-4. Thus, all read and write operations can be overlapped. RAID-5 stores parity information but not redundant data (but parity information can be used to reconstruct data). RAID-5 requires at least three and usually five disks for the array. It's best for multi-user systems in which performance is not critical or which do few write operations.        RAID-6: This type is similar to RAID-5 but includes a second parity scheme that is distributed across different drives and thus offers extremely high fault- and drive-failure tolerance.        RAID-7: This type includes a real-time embedded operating system as a controller, caching via a high-speed bus, and other characteristics of a stand-alone computer.        RAID-10: Combining RAID-0 and RAID-1 is often referred to as RAID-10, which offers higher performance than RAID-1 but at much higher cost. There are two subtypes: In RAID-0+1, data is organized as stripes across multiple disks, and then the striped disk sets are mirrored. In RAID-1+0, the data is mirrored and the mirrors are striped.        RAID-50 (or RAID-5+0): This type consists of a series of RAID-5 groups and striped in RAID-0 fashion to improve RAID-5 performance without reducing data protection.        RAID-53 (or RAID-5+3): This type uses striping (in RAID-0 style) for RAID-3's virtual disk blocks. This offers higher performance than RAID-3 but at much higher cost.        RAID-S (also known as Parity RAID): This is an alternate, proprietary method for striped parity RAID from EMC Symmetrix that is no longer in use on current equipment. It appears to be similar to RAID-5 with some performance enhancements as well as the enhancements that come from having a high-speed disk cache on the disk array.        
Similar to RAID, RAIN (also called channel bonding, redundant array of independent nodes, reliable array of independent nodes, or random array of independent nodes) is a cluster of nodes connected in a network topology with multiple interfaces and redundant storage. RAIN is used to increase fault tolerance. It is an implementation of RAID across nodes instead of across disk arrays. RAIN can provide fully automated data recovery in a local area network (LAN) or wide area network (WAN) even if multiple nodes fail. A browser-based, centralized, secure management interface facilitates monitoring and configuration from a single location. There is no limit to the number of nodes that can exist in a RAIN cluster. New nodes can be added, and maintenance conducted, without incurring network downtime. RAIN originated in a research project for computing in outer space at the California Institute of Technology (Caltech), the Jet Propulsion Laboratory (JPL), and the Defense Advanced Research Projects Agency (DARPA) in the United States. The researchers were looking at distributed computing models for data storage that could be built using off-the-shelf components.
The idea for RAIN came from RAID (redundant array of independent disks) technology. RAID partitions data among a set of hard drives in a single system. RAIN partitions storage space across multiple nodes in a network. Partitioning of storage is called disk striping. Several patents have been granted for various proprietary versions of RAIN.
In databases and processing systems, especially stateful processing systems which store or accumulate state as they continue to process or transact, redundancy presents additional complications of ensuring that the redundant component is synchronized with the primary component so as to be ready to take over should the primary component fail.
A Hot Standby (HS) is a mechanism which supports non-disruptive failover of database server system maintaining system availability, i.e. its ability to provide desired service when required, by a second server system ready to take over when the main system is unavailable. In the hot standby replication scheme servers usually have two different roles, the first of which is a primary server and the second a secondary (backup, slave) server. The hot standby configuration provides a way for a secondary database to automatically maintain a mirror image of the primary database. The secondary database on the secondary server is usually of read-only type and it is logically identical to the primary database on the primary server. In case a failure occurs in the primary server, the secondary server can take over and assume the role of a new primary server.
There are several methods for achieving high availability in computer systems that contain databases. One known way to carry out continuous hot standby is to mirror the entire system, i.e. databases and the applications that use the database. All operations of the system are performed on both applications of the system. The applications write each transaction to their respective databases so both systems are completely synchronized at all times. To ensure that the applications and their databases are mutually in synchronization, typically a mechanism called application checkpointing is used. After each executed operation, the application ensures by some means that the other application has executed the same operation. In other words, the secondary database in association with the secondary application precisely mirrors the primary database and application. The application level mirroring is a good choice for real-time applications where everything, including the application processes need to be fault tolerant.
The primary process actually performs the work and periodically synchronizes a backup process with the primary process using checkpointing techniques. With prior known checkpointing techniques, the primary sends messages that contain information about changes in the state of the primary process to the backup process. Immediately after each checkpoint, the primary and backup processes are in the same state.
In other prior known checkpointing methods, distinctions between operations that change state (such as write operations) and operations that do not change the state (such as read operations) are not made, and all operations are checkpointed to the backup process. Such a system is shown in U.S. Pat. No. 4,590,554 (Glazer—Parallel Computer Systems) where all inputs to the primary are provided via messages and all messages sent to the primary are made available to the secondary or backup, essentially allowing the backup to “listen in on” the primary's messages. Another such system is described in and U.S. Pat. No. 5,363,503 (Gleeson—Unisys Corporation) where checkpointing is provided as described in U.S. Pat. No. 4,590,554.
Other prior art, such as that shown in U.S. Pat. No. 4,228,496 (Katzman—Tandem Computers), describe that the primary receives a message, processes the message, and produces data. The produced data is stored in the primary's data space thereby changing the primary's data space. The change in the primary's data space causes a checkpointing operation of the data space to be made available to the backup. Thus, there is frequent copying of the primary's data space to the backup's data space, which uses a significant amount of time and memory for transferring the state of the primary to the backup. It may also result in the interruption of service upon failure of the primary. The overhead for such checkpointing methods can have considerable performance penalties.
Other prior art examples attempt to update only portions of the state of the primary that has changed since the previous update, but use complex memory and data management schemes. In others as shown in U.S. Pat. No. 5,621,885 (Del Vigna—Tandem Computers) the primary and backup, which run on top of a fault tolerant runtime support layer (that is, an interface between the application program and operating system) are resident in memory and accessible by both the primary and backup CPUs used in the described fault-tolerance model. The primary and backup processes perform the same calculations because they include the same code.
U.S. Pat. No. 6,954,877 discloses a system and method for checkpointing a primary computer process to a backup computer process such that if there is a failure of a primary process, the backup process can takeover without interruption. In addition, upgrades to different version of software or equipment can take place without interruption. A lightweight checkpointing method is disclosed that allows checkpointing of only external requests or messages that change the state of the service instance, thereby reducing the overhead and performance penalties.
In particular, a computing system provides a mechanism for checkpointing in a fault-tolerant service. The service is made fault tolerant by using a process pair; the primary process performs the work officially, while one or more backup processes provide a logical equivalent that can be used in the event of failure. The primary and backup are allowed to be logically equivalent at any given point in time, but may be internally different physically or in their implementation.
Implementation of checkpointing mechanisms requires lots of work from the application programmers as the application checkpointing mechanism is a difficult task to implement. Another method for processing hot standby replication operations is to create a transaction log of the operations of a transaction run in the primary server, transfer the log to the secondary server and run serially the transferred transaction log on the secondary server. This log is a record of all data items that have been inserted, deleted or updated as a result of processing and manipulation of the data within the transaction. The data needs to be written to both databases before it can be committed in either of the databases. This ensures that data is safely stored in the secondary server before the primary server sends acknowledgement of successful commit to the client application. An example of this kind of data mirroring system is described in the U.S. Pat. No. 6,324,654 where “A primary mirror daemon on a local computer system monitors the writelog device (redundant data storage or memory device) for data updates and feeds the data over a network in the same order in which it is stored to a receiving remote mirror daemon on a remote computer system, which in turns commits the data updates to a mirror device.” In a situation of a failure recovery these primary and secondary mirror daemons transfer the log to the secondary node where the log is run just as it was in the primary node. The replicated operations are run serially in the secondary node which slows down processing speed and hence reduces overall performance.
Still another mechanism for achieving database fault tolerance is to have an application connect to two databases. Whenever the application executes an application function, it commits the related data changes to both servers. To ensure that the transaction is committed in both databases, the application typically needs to use so called two-phase commit protocol to ensure the success of the transaction in both databases. If the transaction fails in either of the databases, it needs to fail also in the other databases. Using two-phase commit protocol needs to be done in the application which makes the application code more complex. Moreover, distributed transactions are quite common cause to performance problems as the transaction cannot be completed before both databases acknowledge the transaction commit. In this scenario, recovery from error situations can also be very difficult.
Still another way for processing hot standby replication operations is to copy the transaction rows to the secondary node after they have been committed on the primary node. This method is a mere copying procedure where transactions are run serially in the secondary node. This method is known as asynchronous data replication. This method is not always suitable for real-time database mirroring because all transactions of the primary database may not yet be executed in the secondary database when the fail-over from primary to secondary happens.
Many database servers are able to execute concurrent transactions in parallel in an efficient manner. For example, the server may execute different transactions on different processors of a multi-processor computer. This way, the processing power of the database server can be scaled up by adding processors to the computer. Moreover, parallel execution of transactions avoid blocking effect of serially executed long-running transactions such as creating an index to a large table. To ensure integrity of the database, some concurrency control method such as locking or data versioning needs to be used to manage access to data that is shared between transactions. If two transactions try to have write access to the same data item simultaneously and versioning concurrency control is in use, the server either returns a “concurrency conflict” error to one of the transactions and the application needs to re-attempt executing the transaction later. If locking concurrency control is in use, the server makes one of the transactions wait until the locked resources are released. However, in this scenario it is possible that a deadlock condition, where two transactions lock resources from each other, occurs and one of the transactions must be killed to clear the deadlock condition. The application that tried to execute the killed transaction, must handle the error e.g. by re-attempting execution of the transaction.
These concurrency control methods known in the prior art are suitable for use in the primary server of the Hot Standby database configuration to manage concurrent online transactions of client applications but they cannot be applied in the secondary server of the system. This is because the concurrency conflict errors cannot be allowed in the secondary server as there is no way to properly handle these error conditions. Because of the absence of a proper Hot Standby concurrency control method, in the prior art replicated hot standby operations are run substantially in a serial form in the secondary node. Because operations cannot be executed in parallel, it is difficult to improve secondary server's performance without raising problems in data integrity and transaction consistency. Essentially, a mechanism is needed that allows transactions to run parallel but that ensures that transactions are not started too early and they are committed before dependent transactions are started.
U.S. Pat. No. 6,978,396 discloses a mechanism to run transaction operations originating from a primary server used to replicate data in parallel in a secondary server and relates to running concurrent or parallel operations in a secondary server for redundancy, recovery and propagated transactions. According to the disclosure, executing parallel operations in a secondary server improves performance and availability and how it maintains transaction order and output congruent with the primary server where transaction operations are originated. A set of specific rules is determined. The specific rules are defined on basis of a “first timestamp” and “second timestamp” attached to each transaction in the primary server and the rules form a “timestamp criteria”. When a transaction meets this timestamp criteria it can be run in parallel with other transactions met the same criteria in the secondary server in accordance with the instructions set in the specific rules to maintain the transaction order and output correct.
As can be seen, implementation of fault tolerance in complex processing systems requires complex logic to ensure that the redundant components are synchronized with the primary component so that the backup component is ready to take over should the primary component fail.
Accordingly, there is a need for a simplified mechanism for providing fault tolerance which reduces the complexities related to ensuring that the redundant component is ready to take over for a filed primary component.