Interconnections between nodes on a data link, e.g., a network, typically include some type of a traffic flow control technique. A credit-based flow control technique is one such traffic flow control technique. The credit-based flow control techniques currently available in the market and generally known to those skilled in the art are typically designed for flow control between two switch elements, referred to as hubs, at a network level on a one-to-one basis. The hub-to-hub, credit-based, flow control typically resolves congestion earlier as compared to end-to-end flow control techniques, thereby aiding performance.
The need for high performance in information technology systems, particularly high capacity information technology systems, is driven by several factors. In many industries, critical information technology applications require outstanding levels of service. At the same time, the world is experiencing an information explosion as more and more users demand timely access to a huge and steadily growing mass of data including high quality multimedia content. The users also demand that information technology solutions protect data and perform under harsh conditions with minimal data loss.
As is known in the art, large computer systems and data servers sometimes require large capacity data storage systems. One type of data storage system is a magnetic disk storage system. Here a bank of disk drives and the computer systems and data servers are coupled together through an interface. The interface includes storage processors that operate in such a way that they are transparent to the computer. That is, data is stored in, and retrieved from, the bank of disk drives in such a way that the computer system or data server merely thinks it is operating with one memory. One type of data storage system is a RAID data storage system. A RAID data storage system includes two or more disk drives in combination for fault tolerance and performance.
An I/O interconnect architecture that is intended to support a wide variety of computing and communications platforms is the Peripheral Component Interconnect (PCI) Express architecture described in the PCI Express Base Specification, Rev. 1.0a, Apr. 15, 2003 (hereinafter, “PCI Express Base Specification” or “PCI Express standard”). The PCI Express architecture describes a fabric topology in which the fabric is composed of point-to-point links that interconnect a set of devices. For example, a single fabric instance (referred to as a “hierarchy”) can include a Root Complex (RC), multiple endpoints (or I/O devices) and a switch. The switch supports communications between the RC and endpoints, as well as peer-to-peer communications between endpoints. The PCI Express architecture is specified in layers, including software layers, a transaction layer, a data link layer and a physical layer. The software layers generate read and write requests that are transported by the transaction layer to the data link layer using a packet-based protocol. The data link layer adds sequence numbers and CRC to the transaction layer packets. The physical layer transports data link packets between the data link layers of two PCI Express agents.
The switch includes a number of ports, with at least one port being connected to the RC and at least one other port being coupled to an endpoint as provided in the PCI Express Base Specification. The RC, switch, and endpoints may be referred to as “PCI Express devices”.
The switch may include ports connected to non-switch ports via corresponding PCI Express links, including a link that connects a switch port to a root complex port. The switch enables communications between the RC and endpoints, as well as peer-to-peer communications between endpoints. A switch port may be connected to another switch as well.
Typically, the switch has a controller subsystem which is a virtual port for the system. The controller subsystem has the intelligence for the switch and typically contains a microcontroller. The controller subsystem is in communication with the switch's other ports to set the configuration for the ports on power up of the system, to check the status of each of the ports, to process transactions which terminate within the switch itself, and to generate transactions which originated from the switch itself.
As noted above, in PCI Express, information is transferred between devices using packets. In order to meet various transactions such as a memory write request, a memory read request, an I/O write request and an I/O read request, not only packets including a header and variable-length data, but also packets including only a header and not data are used in the PCI Express. For example, a memory read request packet that makes a memory read request and an I/O read request packet that makes an I/O read request each include only a header.
Credit-based flow control is used in PCI Express. In this flow control, a receiving device previously notifies a transmitting device of a credit indicative of the size of an effective receiving buffer in the receiving device as flow control information. The transmitting device can transmit information for the size specified by the credit. In PCI Express, for example, a timer can be used as a method for transmitting credits regularly from the receiving device to the transmitting device.
In particular, according to the PCI Express Link Layer definition a link may be down (DL_Inactive=no transmission or reception of packets of any type), fully active (DL_Active), i.e., fully operational and capable of transmitting and receiving packets of any type, or in the process of being initialized (DL_Init). Link states may be communicated between link partners via DLLPs (Data Link Layer Packets), which are 6-byte packets that communicate link management specific information between the two devices sharing the link. Link state DLLPs have strict priority over all packets (transaction layer packets (TLPs) and DLLPs) except packets that are in-flight. Link state acknowledgements are sent as early as possible, i.e., as soon as the transmission of the packet currently occupying the link is completed.
The PCI Express architecture supports the establishment of direct endpoint-to-endpoint logical paths known as Virtual Channels (VCs). This enables a single switched fabric network to service multiple, independent logical interconnects simultaneously, each VC interconnecting end nodes for control, management, and data. Each VC provides its own queue so that blocking in one VC does not cause blocking in another. Since each VC has independent packet ordering requirements, each VC may be scheduled without dependencies on the other VCs.
The architecture defines three VC types: Bypass Capable Unicast (BVC); Ordered-Only Unicast (OVC); and Multicast (MVC). BVCs have two queues—an ordered queue and a bypass queue. The bypass queue provides BVCs bypass capability, which may be necessary for deadlock free tunneling of protocols. OVCs are single queue unicast VCs, which may be suitable for message oriented “push” traffic. MVCs are single queue VCs for multicast “push” traffic.
When the fabric is powered up, link partners in the fabric may negotiate the largest common number of VCs of each VC type. During link training, the largest common sets of VCs of each VC type are initialized and activated prior to any non-DLLP packets being injected into the fabric.
The architecture provides a number of congestion management techniques, one of which is the credit-based flow control (FC) technique used to prevent packets from being lost due to congestion. Link partners (e.g., an endpoint and a switch element) in the network exchange FC credit information, e.g., indicating the local device's available buffer space for a particular VC, to guarantee that the receiving end of a link has the capacity to accept packets.
FC credits may be computed on a VC-basis by the receiving end of the link and communicated to the transmitting end of the link. Typically, packets may be transmitted only when there are enough credits available for a particular VC to carry the packet. Upon sending a packet, the transmitting end of the link may debit its available credit account by an amount of FC credits that reflects the size of the sent packet. As the receiving end of the link processes (e.g., forwards to an endpoint) the received packet, space is made available on the corresponding VC and FC credits are returned to the transmission end of the link. The transmission end of the link then adds the FC credits to its credit account.
FC credit initialization and updates are communicated through the exchange of DLLPs between link partners. InitFC1 and InitFC2 DLLPs are exchanged between link partners and provide the FC credit initialization of both unicast VCs (VCs 0-15) and multicast VCs (VCs 16-19). InitFC1 and InitFC2 DLLPs specifying a VC Index in the range of VC0-VC7 provide initial flow control credit information for any supported BVCS, providing initial values for the bypass queue and the ordered queue. OVC and MVC InitFC DLLPs (VC Indexes in the range of VC8-VC13) provide initial credit information for two VCs each.
VCs may be initialized beginning with VC number 0 and continuing until VC 19 in ascending order. PCI Express ports exchange InitFC1 and InitFC2 DLLPs for VC 0-19 even if they do not implement all twenty VCs. InitFC DLLPs for unsupported VC numbers must indicate credit values of 000h in their corresponding credit fields.
After initialization, the ports may refresh their link partner's credit information by periodically sending them FC credit update information. While FC credit accounting is typically tracked by a transmitting port between FC credit updates, an FC Update DLLP takes precedence over locally calculated credit availability information. With each FC credit update, the receiving side of the FC credit update may discard any local FC credit availability tracking information and resynchronize with the credit information provided by the FC Update DLLP.
In particular, flow control logic distinguishes three types of TLPs:
Posted Requests (P)—Messages and Memory Writes
Non-Posted Requests (NP)—All Reads, I/O, and Configuration Writes
Completions (CPL)—Associated with corresponding NP Requests
In addition, flow control logic distinguishes the following types of TLP information within each of the three types:
Headers (H)
Data (D)
Thus, there are six types of information tracked by flow control logic for each Virtual Channel:
PH (Posted Request headers)
PD (Posted Request Data payload)
NPH (Non-Posted Request headers)
NPD (Non-Posted Request Data payload)
CPLH (Completion headers)
CPLD (Completion Data payload)
For each type, the receiver maintains a value that is a count of the total number of credits granted to the transmitter since initialization (CREDITS_ALLOCATED). This value is included in the InitFC and UpdateFC DLLPs, and is incremented as additional receive buffer space is made available by processing received TLPs.
The transmitter maintains a value that is the most recent number of credits advertised by the receiver (CREDIT_LIMIT). This value represents the total number of credits made available by the receiver since flow control initialization.
For each UpdateFC DLLP received by the transmitter, if CREDIT_LIMIT is not equal to the CREDITS_ALLOCATED value in the UpdateFC DLLP, CREDIT_LIMIT is set to the CREDITS_ALLOCATED value in the UpdateFC DLLP. Thus, for example, if the transmitter somehow misses an UpdateFC DLLP, the transmitter is made fully up to date in the next UpdateFC DLLP that is received.
The transmitter has a gating function that determines whether sufficient credits have been advertised to permit the transmission of a given TLP. If the transmitter does not have enough credits to transmit the TLP, it must block the transmission of the TLP, possibly stalling other TLPs that are using the same Virtual Channel. The transmitter has enough credits if the credits needed does not exceed the difference between CREDIT_LIMIT and the total number of credits already consumed by the transmitter (CREDITS_CONSUMED).
A typical PCI Express device has a fixed amount of memory available to be used in connection with credits, such that credit allocation is a zero sum system: increasing the maximum number of credits allowed for one type of information requires decreasing the maximum number of credits allowed for another type of information.
An interconnect architecture may be used in a modern computer architecture that may be viewed as having three distinct subsystems which when combined, form what most think of when they hear the term computer. These subsystems are: 1) a processing complex; 2) an interface between the processing complex and I/O controllers or devices; and 3) the I/O (i.e., input/output) controllers or devices themselves. A processing complex may be as simple as a single microprocessor, such as a standard personal computer microprocessor, coupled to memory. Or, it might be as complex as two or more processors which share memory.
A blade server is essentially a processing complex, an interface, and I/O together on a relatively small printed circuit board that has a backplane connector. The blade is made to be inserted with other blades into a chassis that has a form factor similar to a rack server today. Many blades can be located in the same rack space previously required by just one or two rack servers. Blade servers typically provide all of the features of a pedestal or rack server, including a processing complex, an interface to I/O, and I/O. Further, the blade servers typically integrate all necessary I/O because they do not have an external bus which would allow them to add other I/O on to them. So, each blade typically includes such I/O as Ethernet (10/100, and/or 1 gig), and data storage control (SCSI, Fiber Channel, etc.).
The interface between the processing complex and I/O is commonly known as the Northbridge or memory control hub (MCH) chipset. On the “north” side of the chipset (i.e., between the processing complex and the chipset) is a bus referred to as the HOST bus. The HOST bus is usually a proprietary bus designed to interface to memory, to one or more microprocessors within the processing complex, and to the chipset. On the “south” side of the chipset are a number of buses which connect the chipset to I/O devices. Examples of such buses include: ISA, EISA, PCI, PCI-X, and PCI Express.