1. Field of the Invention
The present invention relates to an apparatus and a method for integrating or re-integrating host machines which have RDMA functions into a network.
2. Description of the Related Arts
Remote DMA (RDMA: Remote Direct Memory Access) is becoming an important requirement for Network Interface Cards (NICs), especially with recent proposals to add RDMA extensions to popular APIs. NICs supporting RDMA are required to support memory protection, e.g., a translation and protection table (TPT), to ensure that RDMA accesses to local memory are properly validated. Implementing such a table increases the hardware cost and might affect performance, especially for NICs connected to a peripheral bus (e.g., PCI: Peripheral Component Interconnect). An alternative to a hardware based TPT is to make the NIC drivers responsible for validating RDMA accesses to other NICs. In this case, the whole system must cooperate to ensure that RDMA accesses do not occur in such a way as to crash a host.
Solutions to this problem may take the form of a system-wide protection table, a copy of which is held by each driver running on a host. Drivers update the local portion of the table and push changes to remote drivers. As long as no host reboots there is no problem. When a host, e.g., HOST_A, reboots, however, the remaining hosts in the systems do not (and cannot) become aware of this event immediately. If HOST_A enables RDMA accesses immediately after reboot, RDMA operations initiated from a remote host based on old information (prior to HOST_A rebooting) in the protection table can crash HOST_A.
For example, when a host (a node in a network) is rebooted, the host's internal RDMA settings, i.e., registered memory regions, are cleared and initialized. But if other hosts' RDMA settings have not been updated, other hosts may try to access the memory of the recently rebooted host using RDMA. At that time, because memory allocation in the host is initialized after rebooting and the initialization of the memory allocation of the rebooted host is not reflected in the other hosts' settings, the other hosts' access to the memory of the rebooted host, directed to an address which is determined by the old settings may overwrite important information for operation of the rebooted host, which could cause a crash of the rebooted host.
It is therefore essential to ensure that a rebooted host is recognized as such by the other hosts and that any information in the system's protection table relating to the rebooted host be invalidated before a rebooted host is fully re-integrated in the system.
Typical solutions to this problem might take the form of a user level process (like a daemon process in Unix (™)) on each node that, perhaps in cooperation with a heartbeat user level process, informs drivers of changes in the status of remote hosts. For example, a system may have a NIC control process and a heartbeat process on each host. The NIC control process is responsible to initialize the NIC and inform the NIC driver of changes, e.g., host reboots, in the system. Each heartbeat process keeps track of the state of other hosts by exchanging messages with heartbeat processes. Suppose that a host, HOST_A, fails then there are no heartbeat messages sent. After a predefined period other heartbeat processes determine that HOST_A has failed. Then each heartbeat process informs its NIC control process which then informs the driver to block all access to HOST_A. Once HOST_A recovers, its heartbeat process starts sending messages again. Other heartbeat processes receive the message and then inform their NIC control process which in turn informs the driver to re-enable access to HOST_A. There are some problems with this approach however. For example, it may take some time before a NIC is initialized as its initialization depends on a user level process, the NIC control process. If the host is heavily loaded this process may not be scheduled to run in a timely manner (slow system response). During this period the NIC is unavailable. Also it is possible that either the NIC control or heartbeat process fail. Failure of the NIC control process compromises the host's response to system events, e.g., access to a rebooted host may not be re-enabled for a very long time. Failure of the heartbeat process may be misinterpreted as a host crash, hence affect overall system performance. In other words by using user level processes the overall reliability of the system may be reduced.
Other problems may result from the user level process potentially using a different communication path for its integration protocol from that used by RDMA requests and hence not being able to ensure that there are no RDMA accesses in transit before the NIC's RDMA functionality is enabled.
Suppose that a node is rebooted. The driver of that node loses all information about its previous state. In particular it loses all previously registered memory regions and all information about memory regions registered on the other nodes. Remote nodes may or may not be aware that the node has rebooted however. The problem of re-integrating the NIC on the rebooted host in a system where some drivers are aware that the host has rebooted while others are not, needs to be solved. The solution must be safe in the sense that it must ensure that there are no RDMA operations in transit with the NIC as a target. If such RDMA operations exist they were initiated before the host was rebooted, are based on old information and will likely crash the host. At the end of re-integration the NIC driver must have a current copy of the system's protection table and all other hosts that are part of the system must be aware that the NIC is a functioning part of the system. Solving this problem with user level monitoring processes delays the entire NIC initialization until a safe time. In order for the user level processes to communicate each host would have to be fitted with other NICs, without RDMA functionality, hence increasing overall hardware cost.