1. Field of the Invention
The present invention relates generally to computer I/O interconnects. More particularly, the present invention relates to read control in a computer I/O interconnect.
2. Description of the Related Art
In a computer architecture, a bus is a subsystem that transfers data between computer components inside a computer or between computers. Unlike a point-to-point connection, a different type of computer input/output (I/O) interconnect, a bus can logically connect several peripherals over the same set of wires. Each bus defines its set of connectors to physically plug devices, cards or cables together.
There are many different computer I/O interconnect standards available. One of the most popular over the years has been the peripheral component interconnect (PCI) standard. PCI allows the bus to act like a bridge, which isolates a local processor bus from the peripherals, allowing a Central Processing Unit (CPU) of the computer to run must faster.
Recently, a successor to PCI has been popularized. Termed PCI Express (or, simply, PCIe). PCIe provides higher performance, increased flexibility and scalability for next-generation systems, while maintaining software compatibility with existing PCI applications. Compared to legacy PCI, the PCI Express protocol is considerably more complex, with three layers—the transaction, data link and physical layers.
In a PCI Express system, a root complex device connects the processor and memory subsystem to the PCI Express switch fabric comprised of one or more switch devices (embodiments are also possible without switches, however). In PCI Express, a point-to-point architecture is used. Similar to a host bridge in a PCI system, the root complex generates transaction requests on behalf of the processor, which is interconnected through a local I/O interconnect. Root complex functionality may be implemented as a discrete device, or may be integrated with the processor. A root complex may contain more than one PCI Express port and multiple switch devices can be connected to ports on the root complex or cascaded.
PCI Express also supports split read completions. This means that the completion of read request initiated at a particular time may not be performed until a later time. Essentially, the read request must wait in a queue until it is serviced. Since a request is typically only 12-20 bytes, whereas the size of a completion response can range up to 4096 bytes, there is a natural imbalance where requests can accumulate faster than data can be returned.
This relative size imbalance between requests and completion data responses can negatively affect performance if too many requests are active at one time. This is especially true in a typical PCIe system where multiple downstream devices all try to read from a single root complex, and wherein the root complex typically services the read requests in a first-come-first-served fashion. If the requests are for large amounts of data, a long read request queue can develop in the root complex as it services the requests. This long queue can be exacerbated if the final data destination (the source of the read request) has less bandwidth than the data supplier (the request destination), which is common in host-centric PCIe systems, where the link closest to the root complex is typically the widest. Once intermediary buffers are filled, the bandwidth of the root complex effectively reduces to the bandwidth of the data sink.
If a new downstream device sends its first read request into this long queue of requests in the destination, the new read request will wait for the entire read request queue ahead of it to drain before it will get serviced. The long wait time for a response can dramatically impact performance.
For example, suppose a PCIe switch connects a single x8 upstream port to two x4 downstream ports. One downstream port has a FibreChannel RAM disk that is capable of sending 16 1024 byte memory read requests at a time. The other downstream port is a dual Gigabit Ethernet controller that can send 2 read requests at a time (1 per channel), with the read size being either 16 bytes (for a descriptor) or 1500 bytes (for an Ethernet packet). The root complex sends 64 byte completions, so a 1024 byte read request would result in 16 partial completions.
By itself, the Ethernet controller may process 1885 Mb/s with a memory read latency of an Ethernet channel being around 340 ns. When the FibreChannel RAM disk is plugged in, however, the FibreChannel RAM disk processes 752 MB/s of completions (the same as it normally does) while the Ethernet controller performs 180 Mb/s. Here the memory read latency of the Ethernet channel is around 6200 ns. Thus, when both devices are on, the FibreChannel RAM Disk interferes with the Ethernet controller even though the FibreChannel RAM Disk performance itself was not affected. This is because the FibreChannel RAM Disk initially fills the switch's buffer with completions at a x8 rate, but then the upstream bandwidth drops to a x4 rate, due to the switch's downstream link to the FibreChannel device being only x4. Due to the congestion, the Ethernet controller takes much longer to get data back, as seen from the increased latency. Since the Ethernet device can have only 2 reads outstanding, a longer response for those reads results in a major drop in performance.
The above example illustrates how the aggressive reading behavior of one device can dramatically and negatively affect another PCIe device. There is nothing forbidden about this configuration, and by themselves the devices each seem to perform quite well, making this a problem that a cursory analysis of the system would not reveal.