Computers and other systems may be expanded in functionality by adding peripheral devices. A wide variety of peripheral devices are available, such as printers, communications and network devices, cameras, music players, and many other devices. In additional to semi-permanent devices, portable devices may be temporarily connected to a host computer using a peripheral bus.
Peripheral buses that connect peripheral devices to a host such as a personal computer (PC) follow different standards. Peripheral Component Interconnect (PCI) is a widely-deployed peripheral bus standard. Peripheral Component Interconnect Express (PCIe) is a newer standard that employs a high speed serial bus. PCIe is quickly gaining acceptance.
FIG. 1 shows a typical PCIe system. Instructions in programs are executed by central processing unit CPU processor 10. Instructions and data may be stored in dynamic-random-access memory (DRAM) memory 14, or in some other memory. Memory controller 12 buffers addresses and data from processor 10 to memory 14, and may generate control signals that are specific to the type and speed of memory chips in memory 14.
Memory controller 12 may also have bus-bridge logic that allows processor 10 to write and data to and from peripheral devices on a peripheral bus. Root complex 16 acts as the head or root of a peripheral bus that connects to several peripheral devices at endpoints that resemble a tree-like structure. Simpler bus protocols only allow processor 10 to initiate any transfers over the peripheral bus, while more advanced or extended bus protocols allow endpoint peripheral devices to initiate transfers as bus masters. For example, peripheral endpoint device 21 might initiate a transfer as a bus master, reading memory 14 directly, without using processor 10. Bus mastering is often preferred since processor 10 may not be delayed by the direct transfer.
Some peripheral buses such as PCIe may allow for differing speeds on different links of the bus. PCIe switch 20 has one uplink port to root complex 16, bus link 24, which operates at a higher 8× speed or bandwidth. PCIe switch 20 has three downlink ports to downstream peripheral endpoint devices 21, 22, 23, over bus links 26, 27, 28.
Peripheral endpoint device 23 may be a high-speed peripheral, allowing bus link 28 to operate at the higher 8× speed, while peripheral endpoint device 21 is a slower Peripheral endpoint device 22 may be an intermediate-speed peripheral, allowing bus link 27 to operate at a 4× speed.
When processor 10 reads or writes data to peripheral endpoint device 23, root complex 16 and PCIe switch 20 can operate at the higher 8× speed. However, when processor 10 reads or writes data to peripheral endpoint device 21, PCIe switch 20 can send data over bus link 26 only at the minimum 1× speed.
FIG. 2 shows a multi-level peripheral bus. There may be several levels of PCIe switches between root complex 16 and peripheral endpoint devices. PCIe switch 30 connects to root complex 16 over bus link 24, and also connects to peripheral endpoint devices 31, 32 over high-speed bus links 34, 35. One of the downlink ports from PCIe switch 30 connects to PCIe switch 20 over switch bus link 36, which can also operate at 8× bandwidth.
Although some peripheral endpoint devices 23, 31, 32 and both PCIe switches 20, 30 can operate at the higher 8× bandwidth, when a data transfer to a slower peripheral endpoint device 21 occurs, data transfer rates slow down to match the slower peripheral endpoint device 21. Other peripheral endpoint devices may have to wait while the slower transfer to peripheral endpoint device 21 occurs, even though PCIe switches 20, 30 could handle more data. For example, a pending transfer to 8× peripheral endpoint device 23 may have to wait until a transfer to 1× peripheral endpoint device 21 finishes, since both transfers go though PCIe switch 20. The current transfer to slower peripheral endpoint device 21 is at the head of the line, or top of the queue, and delays or blocks pending transfers to other peripheral endpoint devices. This is known as head-of-line blocking.
As the transfer of data slows down on high speed link 26, buffers in switch 20 become full, and switch 20 cannot accept additional data, thus reducing the effective speed on high speed link 36. The slow down of high speed link 36 eventually causes the buffer in switch 30 to fill up, and eventually slows down high speed link 24, degrading the performance of the entire system. A slow device thus can create head of line blocking that can paralyze the entire system in a switching environment.
Head-of-line blocking can degrade performance of a peripheral bus such as PCIe. Buffers may be added to PCIe switches 20, 30 to allow data to be stored in the PCIe switch from the higher-speed bus link, and then transferred to the slower peripheral endpoint device. Such buffering may allow transfers to higher-speed peripheral endpoint devices to experience less delay. Ideally, buffers large enough to store an entire transfer to a slower peripheral endpoint device are provided in PCIe switch 20. However, data transfers may be quite large, and may occur over extended periods of time, causing the size of such a buffer to be prohibitively large.
Software may tend to prefer to use larger sizes of data payloads or larger packets, since the relative overhead as a percentage of the total transfer is decreased for larger payloads. For example, a transfer header may be a fixed size, such as 128 bytes. The overhead for the header is much larger percentage for a data payload of 256 bytes than for a payload of 4K bytes. Thus software tends to use larger packet sizes by partitioning data into fewer large packets rather than many smaller packets.
FIG. 3 is a transfer diagram showing head-of-line blocking in a peripheral bus. The PCIe switch connects to the root complex using a 16× bandwidth, while the two one peripheral endpoint devices connect over 8× bus links. The PCIe switch sends a message to the root complex that indicates that there are empty buffer spaces in the PCIe switch. Both peripheral endpoint devices send requests to the PCIe switch, with peripheral endpoint device A arriving first, ahead of the request from peripheral endpoint device B. These requests are to read from memory 14 through PCIe switch 20 and root complex 16 of FIG. 1. Peripheral endpoint devices 21, 23 act as bus masters.
The requests from peripheral endpoint devices A, B are passed on from the PCIe switch to the root complex. The root complex uses the memory controller to read data from the memory. Since the data is large, the root complex divides the requested data into several reply packets for each request.
In response to read request A, the root complex sends three packets A.1, A.2, and A.3 to the PCIe switch. Since the PCIe switch can only store 3 packets, the root complex can only send the first 3 reply packets until the buffer in the PCIe switch becomes full. These 3 packets are sent at the full line rate of the high-speed bus link between the root complex and the PCIe switch.
The PCIe switch passes the data packets to the requesting peripheral endpoint device A as read data A.1, A.2, and A.3. As each packet is read from the buffer in the PCIe switch and sent to the peripheral endpoint device, an entry in the buffer is made available. A buffer credit is reported back to the root complex from the PCIe switch as each packet is read and sent to the peripheral endpoint device. However, there may be some delay in reporting these buffer credits as shown.
When the message with the buffer credit is received by the root complex, the root complex sends another reply packet to the PCIe switch. For example, reply packet A.4 is sent once the first buffer credit=1 message is received. Then reply packet A.5 is sent after the second buffer credit=1 message is received. This continues until all 8 reply packets for request A are sent to the PCIe switch. Then reply packets for request B can be sent, starting with reply packets B.1, B.2, etc.
Since the B.1 reply packet must wait until all 8 read A reply packets are sent, the B packets are blocked by the pending A request. The delay is increased since the A reply packets are sent at a slower rate. While the initial 3 reply packets A.1, A.2, A.3 are sent quickly at the higher line rate, the later packets A.4, A.5, . . . A.8 are sent only after each buffer credit message is received by the root complex. These buffer credit messages are created only as each packet is read from the buffer in the PCIe switch.
This reading of packets from the buffer in the PCIe switch is limited to the speed of the slow bus link to the peripheral endpoint device A. Thus the blocking delay is worsened by the slow bus link. The back-up extends back to the root complex, even though a high-speed bus link connects to the root complex. This blocking can block all requests, even to high-speed peripheral endpoint devices or other PCIe switches, and even when bus links can operate at higher speeds. Thus the system slows down to the speed of the slowest peripheral endpoint device when head-of-line blocking occurs. Furthermore, these delays can be cumulative—as more requests to slow links are received, the delays increase.
While increasing the buffer size in the PCIe switch is useful, very large buffer sizes may be needed. A maximum packet size may be 4K bytes. However, each peripheral endpoint device may have several levels of operation, resulting in several flows that can be active at the same time. Each flow can receive packets of up to 4K bytes each. Thus each peripheral endpoint device may require 8K, 16K, or 32K bytes or more of buffering. When a PCIe switch connects to several peripheral endpoint device, the size of the buffer may exceed several hundred K bytes. This large buffer size is undesirable.
What is desired is a PCIe switch that reduces or avoids delays from head-of-line blocking while using relatively smaller buffers. A PCIe switch that can fragment requests to allow requests from faster peripheral endpoint devices to move ahead of a pending request from a slow peripheral endpoint device is desired.