Society is increasingly relying on computing devices and networks to interact and conduct business. To achieve a high level of availability demanded in critical systems, unplanned downtime caused by software and hardware defects should be reduced.
Several modern applications demand distributed, cooperative systems where computing devices are capable of communicating rapidly with each other, commonly referred to as clustered computing, grid computing, or high-performance computing. Configurations typically consist of a number of loosely-coupled or tightly-coupled computing devices that exchange data with particularly high throughput and/or low latency in order to cooperatively perform a task that is broken down into smaller, often parallel units of work distributed among the members of the cluster. These applications generally exhibit the following characteristics: (1) complex and high speed, low latency data processing, (2) reliable, high volume, low latency data exchange, and (3) high level of availability, i.e. the ability to provide end-user service on a substantially uninterrupted basis. When implemented, however, existing applications tend to tradeoff between these performance requirements, since, due to their contradictory effects on the system behavior, typical designs tend to have difficulty satisfying all of three characteristics simultaneously, as outlined in greater detail below.
The financial services industry is one example of an industry that demands highly available systems. Other industries include inventory management (order processing) systems, online gaming, air traffic control, and online reservation and auction systems. Indeed, a large number of data processing activities are supported by computer systems utilizing reliable high speed server cluster communication.
Complex and high speed, low latency data processing refers to the ability to perform, in a timely fashion, a large number of computations, database retrievals/updates, etc. and the ability to reliably produce the results in as short a time interval as possible. This can be implemented through parallel processing, where multiple units of work are executed simultaneously on the same physical machine or on a distributed cluster utilizing high speed communication links. In some systems, the outcome of each interaction depends on the outcomes of previously completed interactions. Therefore, the order in which the messages are sent and received should be maintained. The parallel aspects of linking computing devices using multiple links are, by and large, non-deterministic. For example, non-determinism can result from race conditions, scheduling tasks by the operating system, or variable network delays. For example, the time for transmitting a message across each link is unlikely to be identical due at least in part to latency issues.
Reliable, high-volume, low-latency data exchange refers to the ability to move data between computing devices cooperating in a networked cluster, observing requirements for guaranteed, in-order delivery of messages. TCP/IP is commonly used for this purpose. It is a widely used networking protocol, provides the guarantee of ordered delivery, and with recent technology advances, has become available at higher network clocking frequencies, resulting in lower latency transmission capabilities. However, TCP/IP is a complex, sophisticated protocol designed for Internet and wide-area public network applications, and as such, has a considerable number of features for network routing, congestion avoidance, bandwidth optimization, and so forth, requiring additional, processing overhead—making it a less suitable choice for applications requiring a dedicated low-latency data link.
Highly available systems attempt to ensure that the percentage of the availability of a given computer system is as close as possible to 100% of the expected time. Such availability can be implemented through redundant software and/or hardware, which takes over the functionality in case a component failure is detected. In order to succeed in a short time while being transparent to the service running on the system, the failover system needs to replicate the data or state on each computing device using a reliable communication link that can guarantee serialized delivery of replicated data messages. As will be appreciated by those of skill in the art, state replication can be particularly challenging in non-deterministic systems. Additionally, to satisfy the desire for high speed and high reliability, state replication must be performed as rapidly as possible, and the results must be somehow guaranteed. TCP/IP is commonly used for this purpose as well, but its performance is sub-optimal due to the higher latency resulting from substantial processing related to its sophisticated networking features. Examples of low-latency data link technologies are Hypertransport, QPI, NUMAlink, Infiniband, RapidIO and PCI Express (PCIe). Different low-latency data link technologies vary in the design trade-offs between flexibility and extensibility versus latency and communication overhead. Some low-latency data link technologies, such as Hypertransport and QPI, do not support computing device interconnect, and are only designed for processor interconnect on a common circuit board. Of the data link technologies that are designed for device interconnect, some have sacrificed lowest possible latency to provide better scalability and networking features, and some require costly, proprietary hardware implementations. A growing number of modern applications require high availability and low-latency device interconnect but do not require sophisticated networking capabilities among a large number of computing devices. And yet, for these applications to perform effectively, they do require the device interconnect to provide the lowest possible latency of message exchange, and to achieve this on a cost-effective basis. To achieve the lowest possible latency, a device interconnect needs to forego features not required by a clustered application, and to minimize the number of processing steps necessary to achieve data transfer.
Remote memory access is one approach that achieves efficient data transfer. Message data is transferred directly from a sending application's memory to a receiving application's memory, without copying data to and from the computer operating system, and without intervening layers of unneeded network routing protocol processing that increase latency. Two commonly used examples of remote memory access are Remote Direct Memory Access (RDMA) and Programmed Input/Output (PIO). Infiniband and PCIe are two device interconnect technologies that support remote memory access. Infiniband is a widely used cluster interconnect that includes a network layer and adds routing information to data packets allowing support of larger networked clusters. However, Infiniband is not implemented natively on the same silicon as commonly used processors, requiring the additional step of translation between Infiniband and PCIe at each computing device endpoint, resulting in additional latency and reduced throughput. Applications requiring only small clusters of computing devices do not require complex network routing capability and the additional overhead of the network routing layer adds unnecessary latency. PCIe is a high-speed serial computer expansion bus standard that was developed primarily as a printed circuit board-level interconnect to interface expansion cards with processors on computer motherboards. PCIe has become ubiquitous in many categories of computing devices and is now natively implemented on processor silicon, further reducing its latency. The use of PCIe over external cables has only recently been developed, and its use as a cluster interconnect is not common, however the increasing performance of later versions of the PCIe standard, the availability of inexpensive PCIe networking devices and the exceptionally low latency afforded by PCIe technology makes it increasingly attractive as a low-latency device interconnect for small application clusters requiring cost-effective low-latency device interconnects.
Although remote memory access can achieve extremely low latency data transfer, it comes with some disadvantages. Because message data is delivered directly into application memory, there is no notification to the receiving application that data has arrived. Another disadvantage is that it provides no protocol for the marshalling and un-marshalling of messages at the application level. Provision of conventional queuing mechanisms at the application level to satisfy this need contributes substantial overhead processing and network transfer operations for the exchange of queue management control information, defeating the goal of achieving the lowest possible latency of message transfer. Because low latency data link protocols forego many of the features inherent in higher level protocols like TCP/IP, they tend to lack the ability to recover from transmission gaps and intermittent or complete link failures. The desire for low latency device interconnects within high performance, mission-critical server clusters appears to be at odds with the need for high availability of services provided by such clusters. Low latency device interconnects are often used to link primary computing devices with back-up devices for purposes of data replication as part of a high-availability cluster, and yet the failure of the device interconnect can jeopardize the integrity of that cluster. What is needed is a system and method for achieving fail-over of cost-effective, low-latency, device interconnects that can guarantee serialized, gap-free message transfer between computing devices.
A number of patents attempt to address at least some of the foregoing problems. For example, U.S. Pat. No. 7,356,636 to Torudbakken et al. discloses a link failover facility in a PCI Express switch to allow the host to access all devices connected to the switch to be accessed even if the link between one of the upstream ports and the host fails. The PCI Express switch focuses performing a link failover to restore communication between a host and a device. As another example, U.S. Patent Publication No. 2008/0112311 to Hariharan et al. discloses a method for providing a failover of a communication link in a network. As yet another example, U.S. Patent Publication No. 2008/0239945 discloses a PCI switch assembly having an automatic link failover. However, the current state of the art does not meet the speed and reliability requirements of many systems requiring high availability. While the prior art provides the restoration of communication over PCIe using an external switch, it does not provide a reliable, recoverable mechanism for the guaranteed, gap-free delivery of serialized messages. In particular, the current state of the art does not provide a highly-available, serialized, guaranteed delivery of messages over a low-latency device interconnect.