Electronic marketplaces may implement fault tolerance systems to help ensure application and/or system uptime and reliability. Fault tolerance is generally regarded as the ability to mask, or recover from, erroneous conditions in a system once an error has been detected. Fault tolerance is typically desired for mission critical systems or applications. “Mission critical” typically refers to an indispensable operation that cannot tolerate intervention, compromise, or shutdown during the performance of its primary function, e.g., any computer process that cannot fail during normal business hours. Exemplary mission critical environments may include business-essential process control, finance, health, safety and security. These environments typically monitor, store, support, and communicate data that cannot be lost or corrupted without compromising their core function.
One exemplary environment where fault tolerance is desirable is in financial markets, and in particular, electronic financial exchanges, such as a futures exchange, such as the Chicago Mercantile Exchange Inc. (CME). Consistent reliable operation is important for ensuring market stability, reliability, and acceptance. Fault tolerance typically describes a computer system or component designed so that, in the event that a component fails, a backup component or procedure can take its place with substantially little or no loss of service. Fault tolerance may be provided with software, hardware, or some combination thereof. For example, in a software implementation, the operating system may provide an interface that allows a programmer to “checkpoint” critical data at pre-determined points within a transaction. In a hardware implementation, the programmer may not need to be aware of the fault tolerant capabilities of the machine. For example, at a hardware level, fault tolerance may be achieved by duplexing each hardware component, e.g., disks are mirrored, multiple processors are “lock-stepped” together, and their outputs are compared for correctness, etc. When an anomaly occurs, the faulty component is determined and taken out of service, but the machine continues to function as usual.
The level of fault tolerance that is required is typically defined by the needs of the system requirements, i.e., specifications that state acceptable behavior upon error. For example, system requirements may specify whether errors should be detected and corrected or merely detected, and how quickly such actions must be taken.
One method of providing fault tolerance to a system is to add redundancy to one or more of the critical components of the system. Redundancy describes computer or network system components, such as fans, hard disk drives, servers, operating systems, switches, and/or telecommunication links that are installed to back up primary resources in case primary resources fail. Redundancy schemes include:
A one-for-N (1:N) redundancy scheme includes one standby component for every N active component.
A one-for-one (1:1) redundancy scheme includes a standby component for each active component.
A one-plus-one (1+1) redundancy scheme is similar to the one-for-one scheme except that in the case of one-plus-one, traffic is transmitted simultaneously on both active and standby components, where the traffic on the standby is generally ignored. An example of one-plus-one redundancy is the 1+1 SONET/SDH APS scheme that avoids loss of data traffic caused by link failure.
When providing redundant operation for processing components, voting logic may be used to compare the results of the redundant logic and choose which component is correct. For example, in triple mode redundancy, three redundant components may be provided wherein if the result of one component fails to match the other two, which match each other, the ultimate result will be that of the two components that matched.
A well-known example of a redundant system is the redundant array of independent disks (“RAID”), which involves of storing the same data in different places (thus, redundantly) on multiple hard disks. By placing data on multiple disks, I/O (input/output) operations can overlap in a balanced way, improving performance. Since multiple disks increase the mean time between failures (MTBF), storing data redundantly also increases fault tolerance. A RAID appears to the operating system to be a single logical hard disk. RAID employs the technique of disk striping, which involves partitioning each drive's storage space into units ranging from a sector (e.g., 512 bytes) up to several megabytes. The stripes of all the disks are interleaved and addressed in order. In a single-user system where large records (such as medical or other scientific images) are stored, the stripes are typically set up to be small (perhaps 512 bytes) so that a single record spans all disks and can be accessed quickly by reading all disks at the same time. In a multi-user system, better performance requires establishing a stripe wide enough to hold the typical or maximum size record. This allows overlapped disk I/O across drives.
There are several types of RAID systems:
RAID-0 has striping but no redundancy of data. It offers very good performance but no fault tolerance.
RAID-1 is also known as disk mirroring and consists of at least two drives that duplicate the storage of data. There is no striping. Read performance is improved since either disk can be read at the same time. Write performance is the same as for single disk storage. RAID-1 provides high performance and very good fault tolerance in a multi-user system.
RAID-2 uses striping across disks with some disks storing error checking and correcting (ECC) information. It has no advantage over RAID-3, discussed next.
RAID-3 uses striping and dedicates one drive to storing parity information. The embedded error checking (ECC) information is used to detect errors. Data recovery is accomplished by calculating the exclusive OR (XOR) of the information recorded on the other drives. Since an I/O operation addresses all drives at the same time, RAID-3 cannot overlap I/O. For this reason, RAID-3 may be best applied in single-user systems with long record applications.
RAID-4 uses large stripes and allows reading records from any single drive and the use of overlapped I/O for read operations. Since all write operations have to update the parity drive, no I/O overlapping is possible. RAID-4 offers no advantage over RAID-5, discussed next.
RAID-5 includes a rotating parity array, thus addressing the write limitation in RAID-4. Thus, all read and write operations can be overlapped. RAID-5 stores parity information but not redundant data (but parity information can be used to reconstruct data). RAID-5 requires at least three and usually five disks for the array, and is most useful in multi-user systems in which performance is not critical or which perform few write operations.
RAID-6 is similar to RAID-5 but includes a second parity scheme that is distributed across different drives and thus offers extremely high fault and drive failure tolerance.
RAID-7 includes a real-time embedded operating system as a controller, caching via a high-speed bus, and other characteristics of a stand-alone computer.
RAID-10 is the result of combining RAID-0 and RAID-1 and offers higher performance than RAID-1 but at a much higher cost. There are two subtypes, RAID-0+1 and RAID-1+0. In RAID-0+1, data is organized as stripes across multiple disks, and then the striped disk sets are mirrored. In RAID-1+0, the data is mirrored and the mirrors are striped.
RAID-50 (or RAID-5+0) consists of a series of RAID-5 groups and striped in RAID-0 fashion to improve RAID-5 performance without reducing data protection.
RAID-53 (or RAID-5+3) uses striping (in RAID-0 style) for RAID-3's virtual disk blocks. This offers higher performance than RAID-3 but at much higher cost.
RAID-S (also known as Parity RAID) is an alternate, proprietary method for striped parity RAID from EMC Symmetrix and includes a high-speed disk cache on the disk array.
Similar to RAID, RAIN (also called channel bonding, redundant array of independent nodes, reliable array of independent nodes, or random array of independent nodes) is a cluster of nodes connected in a network topology with multiple interfaces and redundant storage. RAIN is used to increase fault tolerance. It is an implementation of RAID across nodes instead of across disk arrays. RAIN can provide fully automated data recovery in a local area network (LAN) or wide area network (WAN) even if multiple nodes fail. A browser-based, centralized, secure management interface facilitates monitoring and configuration from a single location. There is no limit to the number of nodes that can exist in a RAIN cluster. New nodes can be added, and maintenance conducted, without incurring network downtime. RAIN originated in a research project for computing in outer space at the California Institute of Technology (Caltech), the Jet Propulsion Laboratory (JPL), and the Defense Advanced Research Projects Agency (DARPA) in the United States. The researchers involved were researching distributed computing models for data storage that could be built using off-the-shelf components.
The idea for RAIN may be rooted in RAID technology. RAID partitions data among a set of hard drives in a single system. RAIN partitions storage space across multiple nodes in a network. Partitioning of storage is called disk striping.
In databases and processing systems, especially stateful processing systems which store or accumulate state as they continue to process or transact, redundancy presents additional complications of ensuring that the redundant component is synchronized with the primary component so as to be ready to take over should the primary component fail.
A hot standby is a mechanism which supports non-disruptive failover of a database server system, thus maintaining system availability. It provides a desired service via a second server system that is ready to take over if the main system becomes unavailable. The hot standby replication scheme includes a primary server and a secondary backup server. The hot standby configuration provides a way for a secondary database to automatically maintain a mirror image of the primary database. The secondary database on the secondary server is usually of read-only type and it is logically identical to the primary database on the primary server. In case a failure occurs in the primary server, the secondary server can take over and assume the role of a new primary server.
There are several methods for achieving high availability in computer systems that contain databases. One known way to carry out continuous hot standby is to mirror the entire system, i.e., all applications and their associated databases. All operations are performed on both applications of the system, and the applications write each transaction to their respective databases. To ensure that the applications and their databases are synchronized, a mechanism called application checkpointing is typically used. After an operation is executed, each application ensures that the other application has executed the same operation. In other words, the secondary database in association with the secondary application precisely mirrors the primary database and application. Application level mirroring is a good choice for real-time applications where everything, including the application processes, should be fault tolerant.
In one example of checkpointing, a primary process may perform operations and periodically synchronize with a backup process using checkpointing techniques. With certain checkpointing techniques, the primary sends messages that contain information about changes in the state of the primary process to the backup process. Immediately after each checkpoint, the primary and backup processes are in the same state.
In other checkpointing methods, distinctions between operations that change state (such as write operations) and operations that do not change the state (such as read operations) are not made, and all operations are checkpointed to the backup process.
In certain checkpointing systems and methods, a primary receives a message, processes the message, and produces data. The produced data is stored in the primary's data space, thereby changing the primary's data space. The change in the primary's data space causes a checkpointing operation of the data space to be made available to the backup. Thus, there is frequent copying of the primary's data space to the backup's data space, which may use a significant amount of time and memory for transferring the state of the primary to the backup. It may also result in the interruption of service upon failure of the primary. The overhead for such checkpointing methods can have considerable performance penalties.
Other systems and methods attempt to update only portions of the state of the primary that has changed since the previous update, but use complex memory and data management schemes. In certain systems, the primary and backup, which run on top of a fault tolerant runtime support layer (that is, an interface between the application program and operating system), are resident in memory and accessible by both the primary and backup central processing units (CPUs) used in the fault tolerant model. The primary and backup processes include the same code and perform the same calculations.
Yet other checkpointing systems and methods are configured such that if there is a failure of a primary process, the backup process can take over without interruption. In addition, upgrades to different versions of software or equipment can take place without interruption. Some methods are lightweight in that they allow checkpointing of only external requests or messages that change the state of the service instance, thereby reducing the overhead and performance penalties.
For example, a computing system may provide a mechanism for checkpointing in a fault tolerant service. The service is made fault tolerant by using a process pair. The primary process performs operations officially, while one or more backup processes provide a logical equivalent that can be used in the event of failure. The primary and backup are allowed to be logically equivalent at any given point in time, but may be internally different physically or in their implementation.
Application checkpointing is a difficult task to implement and may thus require a significant amount of work from the application programmers. For example, another method for processing hot standby replication operations is to create a transaction log of the operations of a transaction run in the primary server. This log is a record of all data items that have been inserted, deleted or updated as a result of processing and manipulation of the data within the transaction. This log is then transferred to the secondary server and is run serially on the secondary server.
In some cases, data is written to both primary and secondary databases before it can be committed in either of the databases. This ensures that data is safely stored in the secondary server before the primary server sends acknowledgement of a successful commit to the client application. In one such system, a primary mirror daemon on a local computer system monitors the writelog device (redundant data storage or memory device) for data updates and feeds the data over a network in the same order in which it is stored to a receiving remote mirror daemon on a remote computer system, which in turns commits the data updates to a mirror device. In a situation of a failure recovery, these primary and secondary mirror daemons transfer the log to the secondary node where the log is run just as it was in the primary node. The replicated operations are run serially in the secondary node, which may slow down processing speed and reduce overall performance.
Still another mechanism for achieving database fault tolerance is to connect an application to two databases. Whenever the application executes an application function, it commits the related data changes to both servers. To ensure that the transaction is committed in both databases, the application typically uses a so-called two-phase commit protocol to ensure the success of the transaction in both databases. If the transaction fails in either of the databases, it should also fail in the other databases. A two-phase commit protocol is implemented in the application, making the application code more complex. Moreover, distributed transactions are a common reason for performance problems, because a transaction cannot be completed until both databases acknowledge a transaction commit. In this scenario, recovery from error situations can also be very difficult.
Still another way for processing hot standby replication operations is to copy the transaction rows to the secondary node after they have been committed on the primary node. This method is a mere copying procedure where transactions are run serially in the secondary node. This method is known as asynchronous data replication. This method is not always suitable for real-time database mirroring because all transactions of the primary database may not yet be executed in the secondary database when failing over from a primary to a secondary.
Many database servers are able to execute concurrent transactions in parallel in an efficient manner. For example, a server may execute different transactions on different processors of a multi-processor computer. In this way, the processing power of the database server can be scaled up by adding processors to the computer. Moreover, parallel execution of transactions avoids a blocking effect of serially executed long-running transactions, such as creating an index to a large table. To ensure integrity of the database, some concurrency control method, such as locking or data versioning, may be used to manage access to data that is shared between transactions. If two transactions try to obtain write access to the same data item simultaneously while versioning concurrency control is in use, the server may return a “concurrency conflict” error to one of the transactions, and the application may then attempt to execute the transaction at a later time. If locking concurrency control is in use, the server makes one of the transactions wait until the locked resources are released. However, in this scenario it is possible that a deadlock condition occurs, where two transactions lock resources from each other, and one of the transactions must be killed to clear the deadlock condition. The application that attempted to execute the killed transaction must handle the error, e.g., by re-attempting execution of the transaction.
These concurrency control methods may be suitable for use in the primary server of the hot standby database configuration to manage concurrent online transactions of client applications, but may not be applied in the secondary server of the system. Concurrency conflict errors cannot be properly handled, and thus cannot be allowed, in a secondary server. Without a proper hot standby concurrency control method, replicated hot standby operations are run substantially in a serial form in the secondary node. Because operations cannot be executed in parallel, it is difficult to improve a secondary server's performance without raising problems in data integrity and transaction consistency.
As noted above, fault tolerance systems may be implemented in a financial instrument trading system. A financial instrument trading system, such as a futures exchange, such as the Chicago Mercantile Exchange Inc. (CME), provides a contract market where financial instruments, e.g., futures and options on futures, are traded using electronic systems. “Futures” is a term used to designate all contracts for the purchase or sale of financial instruments or physical commodities for future delivery or cash settlement on a commodity futures exchange. A futures contract is a legally binding agreement to buy or sell a commodity at a specified price at a predetermined future time. An option contract is the right, but not the obligation, to sell or buy the underlying instrument (in this case, a futures contract) at a specified price within a specified time. The commodity to be delivered in fulfillment of the contract, or alternatively the commodity for which the cash market price shall determine the final settlement price of the futures contract, is known as the contract's underlying reference or “underlier.” The terms and conditions of each futures contract are standardized as to the specification of the contract's underlying reference commodity, the quality of such commodity, quantity, delivery date, and means of contract settlement. Cash settlement is a method of settling a futures contract whereby the parties effect final settlement when the contract expires by paying/receiving the loss/gain related to the contract in cash, rather than by effecting physical sale and purchase of the underlying reference commodity at a price determined by the futures contract, price.
An exchange may provide for a centralized “clearing house” through which trades made must be confirmed, matched, and settled each day until offset or delivered. The clearing house may be an adjunct to an exchange, and may be an operating division of an exchange, which is responsible for settling trading accounts, clearing trades, collecting and maintaining performance bond funds, regulating delivery, and reporting trading data. One of the roles of the clearing house is to mitigate credit risk. Clearing is the procedure through which the clearing house becomes buyer to each seller of a futures contract, and seller to each buyer, also referred to as a novation, and assumes responsibility for protecting buyers and sellers from financial loss due to breach of contract, by assuring performance on each contract. A clearing member is a firm qualified to clear trades through the clearing house.
Current financial instrument trading systems allow traders to submit orders and receive confirmations, market data, and other information electronically via electronic messages exchanged using a network. Electronic trading systems ideally attempt to offer a more efficient, fair and balanced market where market prices reflect a true consensus of the value of traded products among the market participants, where the intentional or unintentional influence of any one market participant is minimized if not eliminated, and where unfair or inequitable advantages with respect to information access are minimized if not eliminated.
Electronic marketplaces attempt to achieve these goals by using electronic messages to communicate actions and related data of the electronic marketplace between market participants, clearing firms, clearing houses, and other parties. The messages can be received using an electronic trading system, wherein an action associated with the messages may be executed. For example, the message may contain information relating to an order to buy or sell a product in a particular electronic marketplace, and the action associated with the message may indicate that the order is to be placed in the electronic marketplace such that other orders which were previously placed may potentially be matched to the order of the received message. Thus the electronic marketplace may conduct market activities through electronic systems.
As can be seen, the use of dedicated backup components for fault tolerance can require complex logic to ensure that backup components are synchronized with primary components, so that a backup component is ready to take over should a primary component fail. Moreover, the use of dedicated backup components results in additional costs due to the extra hardware required. Additionally, a large class of failures could result in both the primary and the backup machines failing. However, if a backup component is removed to free up or reduce the number of resources utilized in a system, the overall system loses some amount of processing power, and so the hardware that is implemented must be more efficiently used. If fault tolerance is still desired, then the removal of dedicated backup components also requires efficiently and accurately determining which machines should provide fault tolerance at what times, and for which applications.
Some fault tolerance systems redistribute jobs to other components or other resources based on the number of jobs being handled by the other resources. For example, the Apache Samza framework includes Apache Hadoop YARN, which can be implemented on a cluster of machines. Samza may include tasks and containers that handle the tasks. In Samza/Hadoop YARN, tasks from a failed machine in a cluster are migrated to another machine. However, Samza does not check or consider a current load of its containers before selecting a container for failover services. Samza monitors container resource usage (e.g., CPU, memory, disk, network) but does not account for fluctuating loads of containers. In financial applications, it may be important to assign failed or orphaned jobs to machines that can handle the additional workload. Selecting the right machine can impact overall system latency and resource allocation.
Even fault tolerance systems designed for long running tasks, like Apache YARN, require that all jobs of a task run in containers of a fixed size. YARN containers, which hold the Samza jobs, have fixed resources (e.g., memory and CPU cores). Apache YARN fails to dynamically allocate jobs based on a fluctuating job load. Thus, resources are not shared efficiently. Instead, the load may be assumed to be predetermined and/or fixed.