1. Field of the Invention
The present invention relates to the data exchange between computers using networking protocols, especially between central processing units (CPUs) and functional subsystems of a computer system.
2. Description of the Related Art
A network adapter, also known as a network interface controller (NIC), is a piece of computer system hardware which allows computers to communicate over a computer network. In today's computer systems a network adapter is often implemented as an integrated circuit on a chip which is directly mounted on a printed circuit board of the computer system hardware, e.g. on a so-called motherboard. The network adapter can be implemented as part of the regular chipset of the computer system itself, or it can be implemented as low cost dedicated chip. For larger computer server systems a network adapter is often provided in form of a network adapter card instead (a printed circuit card comprising chips), which is plugged into special slots of the computer system hardware. Such expansion cards are either optional or mandatory for a computer system. In both variants, the network adapter hardware is connected to bus systems of the computer system.
In general, there are four techniques used to transfer data over a computer network using network adapters. Polling is where a processor of the computer system, e.g., a central processing unit (CPU), examines the status of the network adapter under program control. Programmed input/output (I/O) is where a processor of the computer system alerts the network adapter by applying its address to the computer system's address bus. Interrupt driven I/O is where the network adapter alerts a processor of the computer system that it is ready to transfer data. Direct memory access (DMA) is where an intelligent network adapter assumes control of a computer system bus to access computer system memory directly. This removes load from the processors of the computer system but requires a separate processor in the network adapter. A disadvantage is that only trusted network adapter hardware can be used because the direct access to memory of the computer system (e.g., to the main memory) can compromise the security of the computer system.
In case of an Ethernet adapter, a DMA method for sending data typically comprises the following steps. In a first step data together with a data descriptor is prepared for the Ethernet adapter and afterwards the Ethernet adapter is triggered by a processor of the computer system. Then the Ethernet adapter fetches the data descriptor and subsequently the data based on the information about the data provided in the data descriptor. Then the Ethernet adapter sends the data over the computer network. When the sending is completed, the Ethernet adapter prepares an update completion descriptor and informs the processor by sending an interrupt. The problem with this approach is that there is a long round-trip time between the preparation of the data to be send and the notification of the completion of the sending of the data.
In case of an InfiniBand network adapter, a DMA method for sending data typically comprises the following steps. In a first step data is prepared for the InfiniBand adapter and a processor of the computer system writes the data directly into the memory of the InfiniBand adapter. Then the InfiniBand adapter sends the data over the computer network. The processor of the computer system gets a notification from the InfiniBand adapter in case of an error only. While this approach has advantages compared to the described Ethernet send method, the disadvantage is that some existing operating systems for a computer system (e.g., IBM z/OS for IBM System z) are not prepared to use this method because it does not fit into the usual send/receive pattern. But often changes to an operating system are not desirable for various reasons, e.g. in order to save implementation costs.
InfiniBand network transport is based on Remote Direct Memory Access (RDMA), which is also referred to as “hardware put/get” or “remote read/write”. For RDMA, the network adapter implements the RDMA protocol. RDMA allows data to move directly from the memory of one system into that of another without involving either one's operating system. This permits high-throughput, low-latency networking. Memory buffer references called region IDs are exchanged between the connection peers via RDMA messages sent over the transport connection. Special RDMA message directives (“verbs”) enable a remote system to read or write memory regions named by the region IDs. The receiving network adapter recognizes and interprets these directives, validates the region IDs, and performs data transfers to or from the named regions. Even for RDMA network protocols require to perform certain steps in sequence when interpreting the network protocol data.
A functional subsystem of a computer system is responsible for the provision of dedicated functions within the computer system. Especially, a functional subsystem can execute its own operating system instance, which is often the case for controllers embedded in the computer system. One example for a functional subsystem is an I/O subsystem providing certain I/O functions, e.g. an I/O subsystem providing network access for the CPUs. In this case, the I/O subsystem would typically be encapsulated by firmware components of the computer system or by operating system instances executed on the CPUs, e.g. by their kernels and/or by device drivers.
Another example is an entire general purpose computer embedded within the computer system, preferably a computer having a different architecture than the CPUs. Such embedded general purpose computer could be used to execute certain types of application workloads for which it is better suited than the CPUs. An example scenario is to run a database system on the CPUs and a web server on the functional subsystem, where the web server accesses the database system. In this case, the split between the CPUs and the functional subsystem is done on the application level. Therefore, special tasks in the application level are delegated to the functional subsystem.
For various reasons it is desirable to exchange data between the CPUs and the functional subsystems via networking protocols. For example, this simplifies the implementation of the data transfer significantly. However, a low latency and high bandwidth data exchange between the CPUs and the functional subsystems is often crucial for such computer systems. Therefore, the use of RDMA between the CPUs and the functional subsystem is desirable.
The DMA and RDMA environments are essentially hardware environments. This provides advantages but it also entails some risk and limitations. As described in J. C. Mogul, “TCP offload is a dumb idea whose time has come”, Proc. of Hot OS IX: The 9th Workshop on Hot Topics in Operating Systems, USENIX Association, RDMA introduces many co-development dependencies between the various hardware and software components involved in the overall computer system.
Further, RDMA introduces several problems, especially in the area of computer system security. For example, an operating system executed on the functional subsystem is typically not as secure and reliable as an operating system executed on the CPUs. But once the operating system on the functional subsystem is compromised, it is also possible to compromise an operating system executed on a CPU.
In order to provide an efficient memory protection mechanism across applications on different nodes within a multi-node computer system, wherein the applications exchange data via RDMA, U.S. Patent Application Publication US 2006/0047771 A1 proposes the use of global translation control entry tables that are accessed/owned by the applications and are managed by a device driver in conjunction with a protocol virtual offset address format. But this mechanism requires a symmetric design, in which RDMA operations can be triggered from both sides of the exchange. For a functional subsystem of a computer system, however, it is desirable that the RDMA is performed by the functional subsystem only in order to offload RDMA operations from the CPUs. Such offloading provides not only performance benefits, but can also reduce the design complexity for computer system. For example, it can be complex and expensive to implement the RDMA support on the CPUs. Further, this approach requires adaptations to the operating systems for which data is exchanged in-between.
Also the U.S. Pat. No. 7,181,541 B1 describes an RDMA approach, wherein a memory protection unit is used to prevent access to unauthorized memory addresses during the RDMA data exchange. However, also this approach requires adaptations to the operating systems for which data is exchanged in-between.