The Virtual Interface Architecture (VIA) provides a high-speed, low-latency, low-overhead method of cluster communications between computer systems. Although a standard VIA to Fibre Channel mapping protocol has been defined, it can not be implemented efficiently using off-the-shelf Fibre Channel controllers. The invention described herein is a more general VIA to Small Computer System Interface (SCSI) mapping which can be implemented very efficiently using SCSI media controllers (such as Fibre Channel).
The usual method of interconnecting clustered servers is over a TCP/IP network, typically on an Ethernet network. Although the performance of the underlying Ethernet technology has steadily progressed over the years, the host-resident TCP/IP protocol stack has remained a bottleneck to overall system performance. On multi-processor systems, this bottleneck becomes particularly apparent when the protocol stack contains single-threaded critical sections.
In addition to multi-processor contention, the simple overhead of handling the much higher volume of packets delivered by higher-speed networks like gigabit ethernet consumes a higher percentage of the host's processing power.
A significant benefit, measured in host processor utilization can be realized by using a non-TCP/IP clustering protocol which is non-contentious and utilizes intelligent network interface cards (NICs) acting as offload engines.
The Virtual Interface Architecture (VIA), developed by Intel and Compaq, is just such a clustering technology. VIA is an API (Application Program Interface) and processing model that allows user applications to issue commands directly to the NICs, without any operating system intervention.
However this is not available to many computer systems without substantial overhead, much of which is reduced or eliminated using our approach which includes an adaptation of the VIA semantics to SCSI and, preferably, an improvement to the VIA for use in multiprocessor servers.
Many people in this field are aware of the VIA's features however we feel it useful to mention several of them which we consider of key importance.
Virtual hardware: The NIC hardware is mapped into each application's virtual memory, thereby giving each application its own virtual set of hardware. (NIC stands for Network Interface Card, a commonly used term for a card providing interface to either components, data storage devices or networks through which data communications can occur from a host computer system to one of those things to which the NIC is connected. Communications through such a card to something it is connected to can be called communications between an application on a host computer and a “destination”.) The various VIA objects, some of which are shared between the application and the NICs, are also mapped directly into the application's memory space. These objects include virtual interfaces (VIs), which are communication endpoints comprising send and receive work queues, and completion queues (CQs), which allow completion notifications from multiple VIs to be aggregated and serviced together.
Another “virtual” aspect of VIA is the use of user-level virtual addresses by the hardware and by partner applications. When an application issues a command to the NIC, it uses its own virtual addresses for referencing buffers and descriptors. (A “descriptor” is something defined in the VIA specification which holds all the information needed to perform an operation). Likewise, an application can publish its virtual addresses to its partner at the other end of the VI, and that partner application can then read from and write to those buffers directly via Remote Direct Memory Addressing (RDMA) operations.
To accomplish this use of virtual addresses, the memory regions must be registered with the hardware. Registration pins the referenced pages in memory so they can not be moved, and resolves the physical address of each page. The registration information is made available to the hardware so it can access the memory directly.
To accomplish the direct access of the NIC hardware by user applications, a descriptor format is defined by the VIA API. A single descriptor is used to issue commands referencing very large, scattered/gathered data regions, and to store the resulting status.
Four operations are defined by the architecture: Send, Receive, RDMA-Write, and RDMA-Read. Each send operation consumes exactly one receive descriptor at the other end of the VI. RDMA operations allow applications to share memory regions for read and/or write access.
Each VI endpoint is represented by a pair of work queues. One queue is used to post Send operations to the hardware, and the other is for Receives. Applications can wait for a posted descriptor on either work queue to complete, or they can poll the status of the descriptor at the head of each queue.
Work queues may also be associated with a Completion Queue (CQ), on which completion notifications from multiple VIs' work queues can be aggregated. Applications can wait on CQs, which effectively waits on any of its associated work queues.
Three reliability levels are specified by the VIA specification. Applications can specify a reliability level on a per-VI basis. Reliability levels are: Reliable-Transmission, which guarantees the data will be transmitted; Reliable-Reception, which guarantees that the data is received by the remote side; and Reliable-Delivery, which says that the partner application is notified of the message reception.
There are other VIA implementations including the Giganet cLan and the QLogic FC-VI, but they have their own limitations. Particularly, the Giganet clan is ATM (Asynchronous Transfer Mode)-based, and is rather slow and it is a technology that is therefore near the end of its useful existence. The QLogic FC-VI is a simple FC (Fiber Channel) mapping protocol. It uses all single-frame sequences and does not take advantage of current commodity FC controllers' capabilities, such as hardware acceleration of some protocols, including SCSI FCP. It also requires microcode-able or custom hardware in order to perform adequately.