When a processor-based system is turned on, instructions within the system is run to power up the various parts of the system, such as the video display, the keyboard, and the hard drive. Eventually, an operating system is loaded, which generally includes an attractive graphical user interface. The loaded operating system enables the user to do a myriad of different actions with the system, typically, by loading a piece of software onto the system.
Besides these operations, there are many other actions taking place outside of the view of the system user. Portable machine code (pcode) within the system, for example, enables different entities within the system to communicate with one another. The entities include, but are not limited to, central processing units (CPUs), memories, graphics controllers, busses, and peripheral hubs that connect to and control the various peripheral devices connected to the processor-based system.
As with the higher-level operating system, driver, and other software loaded into the system, the portable machine code running inside the system may experience latency, which diminishes the efficiency of the system. Latency is a measure of time delay and can impact virtually any communication between any devices.
Many systems today are built under the PCI Express standard (PCIe), in which the link power, the number of lanes between devices, is adjustable. One, two, four, eight, sixteen, and thirty-two lanes are possible under PCIe. Thus, a “by eight” (×8) system means there are eight lanes being used, with each lane having two differential signaling pairs, one for transmission and the other for reception. The number of lanes in use at a given moment affects the throughput of the system, and thus the speed at which operations take place.
Also under PCIe, many processor-based systems today are designed with low power states. Particularly for laptops, cellphones, and other power-sensitive devices, the low power states occur when the system is not being used, thus hopefully prolonging the battery life, and thus the portability, of the device. Although low power states may be achieved by turning off parts of the system, reducing the link power is also a mechanism for reducing the power consumed by the system.
Reducing link power to save energy may increase the latency of the system. There are solutions to mitigate the latency issue. The solutions rely on 1) “nimble” hardware, 2) deep buffers, 3) unsaturated queues, or 4) a combination of 1), 2), and 3).
If the hardware in the system is nimble enough, for example, the hardware may re-provision the link rapidly. For example, there are specialized busses that connect between CPUs, known as quick path interconnect (QPI) busses. The QPI bus is designed to speed up communication between two CPUs and has a link width designator, L0p. QPI's L0p “blackout” time during upshift from one link width to another link width is only a few tens of nanoseconds, which allows for short response delays on the order of tens of microseconds to service spurts of heavy traffic between the CPUs.
Deep buffers are provided by endpoints. For example, a network interface card (NIC) may provide 64 kilobytes of buffer storage in its LAN-to-PCIe pipeline. This provides the NIC with large amounts of data to feed through the pipeline during processing flows. Large buffers hide latency by storing incoming requests while the consumer is returning to full operation, such as when exiting a power-control state. The consumer in this context is the buffer content-consuming PCIe link, which is momentarily (e.g., a few microseconds) offline.
Transmit queues may help with the latency issue, but are expected to behave in a particular manner. For example, the QPI bus has a small packet payload, combined with the relatively random behavior of cache-misses, which leaves its relatively small (a few tens of lines) queue in an “un-saturated” state most of the time. At full load, the queue is rarely empty, and is rarely full. Hence, for the QPI bus, a queue-depth threshold works well as a proxy for latency.
Now consider the typical PCie behavior of a front-end server whose main task is to deliver webpages. The hardware isn't “nimble”: a PCie re-provisioning cycle incurs a link blackout on the order of several microseconds, which pushes the “checkpoint” interval for re-provisioning decisions into the millisecond range. The root complex buffer is only four kilobytes deep: The webpage to be transmitted is many times larger. Therefore, when the webpage starts “pouring” through the PCIe transmitter pipeline, it saturates the queue, and when it stops, the queue goes empty. There is little opportunity for the queue to “bounce around” in some mid-state. Hence, using a queue-depth threshold in the root-complex serves as a poor proxy for latency.
Thus, there is a continuing need for a solution that overcomes the shortcomings of the prior art.