InfiniBand™ is an emerging bus technology that hopes to replace the current PCI bus standard, which only supports up to 133 Mbps (Megabits per second) transfers, with a broader standard that supports a maximum shared bandwidth of 566 Mbps. InfiniBand is the culmination of the combined efforts of about 80 members that are led by Intel, Compaq, Dell, Hewlett-Packard, IBM, Microsoft and Sun Systems who collectively call themselves the InfiniBand Trade Association. The InfiniBand Trade Association has published a specification entitled: Infiniband™ Architecture Specification Release 1.0. The Specification spans three volumes and is incorporated herein by reference.
The InfiniBand Architecture (referred to herein as “IBA”) is a first order interconnect technology, independent of the host operating system (OS) and processor platform, for interconnecting processor nodes and I/O nodes to form a system area network. IBA is designed around a point-to-point, switched I/O fabric, whereby end node devices (which can range from very inexpensive I/O devices like single chip SCSI or Ethernet adapters to very complex host computers) are interconnected by cascaded switch devices. The physical properties of the IBA interconnect support two predominant environments:                i. Module-to-module, as typified by computer systems that support I/O module add-in slots        ii. Chassis-to-chassis, as typified by interconnecting computers, external storage systems, and external LAN/WAN access devices (such as switches, hubs, and routers) in a data-center environment.        
IBA supports implementations as simple as a single computer system, and can be expanded to include: replication of components for increased system reliability, cascaded switched fabric components, additional I/O units for scalable I/O capacity and performance, additional host node computing elements for scalable computing, or any combinations thereof. IBA is scalable to enable computer systems to keep up with the ever-increasing customer requirement for increased scalability, increased bandwidth, decreased CPU utilization, high availability, high isolation, and support for Internet technology. Being designed as a first order network, IBA focuses on moving data in and out of a node's memory and is optimized for separate control and memory interfaces. This permits hardware to be closely coupled or even integrated with the node's memory complex, removing any performance barriers.
IBA uses reliable packet based communication where messages are enqueued for delivery between end nodes. IBA defines hardware transport protocols sufficient to support both reliable messaging (send/receive) and memory manipulation semantics (e.g. remote DMA) without software intervention in the data movement path. IBA defines protection and error detection mechanisms that permit IBA transactions to originate and terminate from either privileged kernel mode (to support legacy I/O and communication needs) or user space (to support emerging interprocess communication demands).
IBA can support bandwidths that are anticipated to remain an order of magnitude greater than current I/O media (SCSI, Fiber Channel, and Ethernet). This enables IBA to act as a common interconnect for attaching I/O media using these technologies. To further ensure compatibility across varying technologies, IBA uses IPv6 headers, supporting extremely efficient junctions between IBA fabrics and traditional Internet and Intranet infrastructures.
FIG. 1 is a block diagram of the InfiniBand architecture layers 100. IBA operation can be described as a series of layers 100. The protocol of each layer is independent of the other layers. Each layer is dependent on the service of the layer below it and provides service to the layer above it.
The physical layer 102 specifies how bits are placed on a wire to form symbols and defines the symbols used for framing (i.e., start of packet & end of packet), data symbols, and fill between packets (Idles). It specifies the signaling protocol as to what constitutes a validly formed packet (i.e., symbol encoding, proper alignment of framing symbols, no invalid or non-data symbols between start and end delimiters, no disparity errors, synchronization method, etc.).
The link layer 104 describes the packet format and protocols for packet operation, e.g. flow control and how packets are routed within a subnet between the source and destination. There are two types of packets: link management packets and data packets.
Link management packets are used to train and maintain link operation. These packets are created and consumed within the link layer 104 and are not subject to flow control. Link management packets are used to negotiate operational parameters between the ports at each end of the link such as bit rate, link width, etc. They are also used to convey flow control credits and maintain link integrity.
Data packets convey IBA operations and can include a number of different headers. For example, the Local Route Header (LRH) is always present and it identifies the local source and local destination ports where switches will route the packet and also specifies the Service Level (SL) and Virtual Lane (VL) on which the packet travels. The VL is changed as the packet traverses the subnet but the other fields remain unchanged. The Global Route Header (GRH) is present in a packet that traverses multiple subnets. The GRH identifies the source and destination ports using a port's Global ID (GID) in the format of an IPv6 address.
There are two CRCs in each packet. The Invariant CRC (ICRC) covers all fields which should not change as the packet traverses the fabric. The Variant CRC (VCRC) covers all of the fields of the packet. The combination of the two CRCs allow switches and routers to modify appropriate fields and still maintain an end to end data integrity for the transport control and data portion of the packet. The coverage of the ICRC is different depending on whether the packet is routed to another subnet (i.e. contains a global route header).
The network layer 106 describes the protocol for routing a packet between subnets. Each subnet has a unique subnet ID, the Subnet Prefix. When combined with a Port GUID, this combination becomes a port's Global ID (GID). The source places the GID of the destination in the GRH and the LID of the router in the LRH. Each router forwards the packet through the next subnet to another router until the packet reaches the target subnet. Routers forward the packet based on the content of the GRH. As the packet traverses different subnets, the routers modify the content of the GRH and replace the LRH. The last router replaces the LRH using the LID of the destination. The source and destination GIDs do not change and are protected by the ICRC field. Routers recalculate the VCRC but not the ICRC. This preserves end to end transport integrity.
While, the network layer 106 and the link layer 104 deliver a packet to the desired destination, the transport layer 108 is responsible for delivering the packet to the proper queue pair and instructing the queue pair how to process the packet's data. The transport layer 108 is responsible for segmenting an operation into multiple packets when the message's data payload is greater than the maximum transfer unit (MTU) of the path. The queue pair on the receiving end reassembles the data into the specified data buffer in its memory.
IBA supports any number of upper layers 110 that provide protocols to be used by various user consumers. IBA also defines messages and protocols for certain management functions. These management protocols are separated into Subnet Management and Subnet Services.
FIG. 2 is a block diagram of an InfiniBand subnet 200. An IBA subnet 200 is composed of endnodes 202, switches 204, a subnet manager 206 and, possibly one or more router(s) 208. Endnodes 202 may be any one of a processor node, an I/O node, and/or a router (such as the router 208). Switches 202 are the fundamental routing component for intra-subnet communication. The switches 202 interconnect endnodes 202 by relaying packets between the endnodes 202. Routers 208 are the fundamental component for inter-subnet communication. Router 208 interconnects subnets by relaying packets between the subnets.
Switches 204 are transparent to the endnodes 202, meaning they are not directly addressed (except for management operations). Instead, packets transverse the switches 204 virtually unchanged. To this end, every destination within the subnet 200 is configured with one or more unique local identifiers (LID). From the point of view of a switch 204, a LID represents a path through the switch. Packets contain a destination address that specifies the LID of the destination. Each switch 204 is configured with forwarding tables (not shown) that dictate the path a packet will take through the switch 204 based on a LID of the packet. Individual packets are forwarded within a switch 204 to an out-bound port or ports based on the packet's Destination LID and the Switch's 204 forwarding table. IBA switches support unicast forwarding (delivery of a single packet to a single location) and may support multicast forwarding (delivery of a single packet to multiple destinations).
The subnet manager 206 configures the switches 204 by loading the forwarding tables into each switch 204. To maximize availability, multiple paths between endnodes may be deployed within the switch fabric. If multiple paths are available between switches 204, the subnet manager 206 can use these paths for redundancy or for destination LID based load sharing. Where multiple paths exists, the subnet manager 206 can re-route packets around failed links by re-loading the forwarding tables of switches in the affected area of the fabric.
FIG. 3 is a block diagram of an InfiniBand Switch 300. IBA switches, such as the switch 300, simply pass packets along based on the destination address in the packet's LRH. IBA switches do not generate or consume packets (except for management packets). Referring to FIG. 1, IBA switches interconnect the link layers 104 by relaying packets between the link layers 104.
In operation the switch 300 exposes two or more ports 302a, 302b . . . 302n, between which packets are relayed. Each port 302n communicates with a packet relay 304 via a set of virtual lanes 306a though 306n. The packet relay 304 (sometimes referred to as a “hub or “crossbar”) redirects the packet to another port 302, via that port's associated with virtual lanes 306, for transmission based on the forwarding table associated with the packet relay 304.
During operation a 32-bit word arrives into an InfiniBand virtual link 306 at a port 302 of a switch 300 every clock cycle. To maximize bandwidth and minimize switch latency, it is desirable to be able to transfer data through the switch packet relay at the same frequency. In an 8 port switch, it is desirable to provide at least 3 output ports to the packet relay.
As noted above, IBA uses packets as the main unit of communication. An IBA data packet conforms to the format shown in TABLE 1.
TABLE 1Word/Bits31-2432-1615-87-0Notes 0VLLVERSLrsvLNHDLIDLRH 1resv 5PktLen (11 bits)SLIDGRH 2IPVersTraffic ClassFlow Label 3Payload LengthNext HdrHop Limit 4GRH Body1112OpCodeSrPaTVERPKeyBTH13resv (variantDestination QP14resv 8PSN. . .OtherHeadersn−1EOPPYLDnIRCn+1VRC
As packets pass through the switch 300 they must be checked for errors, this process is typically termed error detection. To perform such error detection the Link Next Header (LNH) field of the packet must be decoded. The LNH field conforms to the format shown in Table 2.
TABLE 2LNHPacket TypeTransportNext Header11IBA GlobalIBAGRH10IBA LocalIBABTH01IP - Non-IBARawGRH00RawRawRWH (Ethertype)
The IBA specification discloses and recommends the use of a state machine to perform a multi-step packet error check. The checks are ordered with no consideration as to the order of the incoming packet data, but instead by their precedence. Fields VL, LVer, LNH, DLID, PktLen, IPVers, TVER, ICRC and VCRC are extracted, stored and analyzed by the state machine. This implies that a packet must be fully received, and hence stored, prior to performing error detection.
FIG. 4 is a flow chart of the operation of a data packet check machine as described in the IBA specification. The data packet check machine resides in each port 302 and determines whether a data packet is valid and should be forwarded from the port 302 to the packet relay 304. The method starts in step 400. Subsequently, a series of checks 402 through 414 are made to validate the packet. The order of the states in FIG. 4 does not necessarily represent the chronological order of the checks, but does represent the priority of the error classes. According to the InfiniBand specification, only one error is logged per packet in step 418. This state ordering determines which one, if any, is logged. If the packet satisfies all of the checks (e.g. states) the packet is forwarded to the packet relay 304 in step 416.
FIG. 5 is a flow chart of the operation of a link packet check machine as described in the IBA specification. The Link Packet Check Machine resides in each port and determines whether a link packet meets the rules of the InfiniBand specification, and thus whether or not the link packet should be interrogated for flow control or other information. The method starts in step 500. Subsequently a series of checks 502 through 508 are made to validate the link header. The order of the states in FIG. 5 does not necessarily represent the chronological order of the checks, but does represent the priority of the error classes. According to the InfiniBand specification, only one error is logged per packet in step 512. This state ordering determines which one, if any, is logged. If the packet satisfies all of the checks (e.g. states) the packet is forwarded to flow control circuitry (not shown) in step 416.
The Inventors of the present invention have recognized a need for methods and apparatus that enable error detection to be performed during reception of a packet, thereby eliminating the need to receive and store the entire packet prior to beginning such error detection.