The present invention relates generally to the field of networks, and more particularly to managing a network routing table configuration.
InfiniBand® is an industry-standard specification that defines and input/output architecture used to interconnect servers, communications infrastructure equipment, storage and embedded systems. InfiniBand® is a registered trademark and service mark of the InfiniBand® Trade Association. InfiniBand® is a computer network communications connection used in high-performance computing featuring very high throughput and very low latency. InfiniBand® is used for data interconnect both among and within computers. InfiniBand® is a commonly used interconnect in supercomputers. InfiniBand® is a type of communications connection for data flow between processors and I/O devices that offers throughput of up to 56 gigabits per second and support for up to 64,000 addressable devices.
The internal data flow system in most personal computers (PCs) and server systems is inflexible and relatively slow. As the amount of data coming into and flowing between components in the computer increases, the existing bus system becomes a bottleneck. Instead of sending data in parallel (typically 32 bits at a time, but in some computers 64 bits) across the backplane bus, InfiniBand® specifies a serial (bit-at-a-time) bus. Fewer pins and other electrical connections are required, saving manufacturing cost and improving reliability. The serial bus can carry multiple channels of data at the same time in a multiplexing signal. InfiniBand® also supports multiple memory areas, each of which can be addressed by both processors and storage devices.
With InfiniBand®, data is transmitted in packets that together form a communication called a message. A message can be a remote direct memory access (RDMA) read or write operation, a channel send or receive message, a reversible transaction-based operation or a multicast transmission. Like the channel model many mainframe users are familiar with, transmission begins or ends with a channel adapter. Each processor (your PC or a data center server, for example) has what is called a host channel adapter (HCA) and each peripheral device has a target channel adapter (TCA). HCAs are I/O engines located within a server. TCAs enable remote storage and network connectivity into the InfiniBand® interconnect infrastructure, called a fabric.
InfiniBand® links have physical and logical state properties. The physical property of the link is negotiated in hardware. The logical state of the link is managed by software. When physical link goes up, the logical state of the link is not active. There is no address assigned to the port, and applications cannot communicate with the port using arbitrary data protocols. A possible communication is done by sending and receiving subnet management protocol (SMP) Unicast datagrams (UD), which are used to discover and configure the network. InfiniBand® networks require a subnet manager software entity running on one of the nodes.
The Subnet Manager uses SMP datagrams to discover and configure the network. The discovery is done via direct route (e.g., by specifying each hop of source-to-destination path) and does not require switch routing. The task of the Subnet Manager is to discover the fabric, assign LID addresses to each end-point, configure switch routing tables and put each end-point to logical Active state. The Subnet Manager is also responsible for removing the no longer present end-points from the routing tables, and answer subnet administration (SA) queries, which perform operations on its internal tables and do multicast management. Once the Subnet Manager brings the end-point to Active state, the end-point can exchange data with other end-points in the fabric in Active state.
The Subnet Manager standard is covered in InfiniBand® Architecture Specification. Existing standards assumes a single Subnet Manager in master role in the fabric. That assumption, nature of network environments, and certain requirements of SMP specifications, can cause significant latencies in bring-up of large, and even not so large, networks to ACTIVE state. The multi-hop discovery is subject to timeouts and retries. Many existing switches can queue a very small number of direct route packets (e.g., some have queue sizes of 1) and require a slow software path, which makes discovery serialized. The discovery of each end-point requires several SMP queries, PortInfo and NodeInfo at the minimum, which further slows the discovery. Setting up each end-point also requires multiple requests (e.g., several PortInfo requests and setting up VL arbitration tables and SL to VL mapping). The handling of most SMPs is implemented via hardware-software-hardware path on the end-points, which requires a software application on the end point that answers in time, which can be a challenge if the whole network cluster is booting and many software applications with direct access to hardware are initializing, causing resource contentions (locking, interrupt configuration, PCI configuration, and so on). Since the subnet manager specification wants to discover the whole fabric in order to build coherent routing tables, the subnet manager specification makes the best effort to wait for all end-points to answer it queries, which further increases the network configuration latency.