1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, apparatus, and products for performance measurement of hardware accelerators.
2. Description of Related Art
General purpose processors have been the workhorses of computation, used to build uniprocessor systems, symmetric multiprocessor systems and chip multi-processor systems over the last several decades. While general purpose cores are designed to achieve the best “average” behavior across a collection of workloads, they are typically sub-optimal for each individual workload. Inherent inefficiencies of general purpose processing cores (introduced due to the constraints of industry standards and the design goal of good performance on average across a wide set of workloads) have been typically hidden over the last several decades thanks to exponential transistor density growth per unit area (Moore's Law) and constant power density per unit area (Dennard Scaling). The new reality is that while Moore's law continues providing ever increasing transistor counts per unit area, Dennard Scaling has slowed down significantly. This means performance growth using general purpose cores is only possible with super linear increase in chip and system power budget. Therefore, computing machinery of the future will be forced to move away from the energy inefficiencies of a general purpose computer towards specialized task-specific processors or accelerators. Specializing hardware cores for specific workloads gives a significant performance advantage, as well as a performance-per-watt advantage. Offloading certain tasks from the general purpose processor to the task-specific accelerator can result in execution speed-up (by orders of magnitude for the task, in some cases) while consuming less power compared to doing the same task on the general purpose processor.
Measuring the performance of an accelerator is used to validate accelerator design. It is a valuable tool in understanding design bottlenecks, guiding chip design, system design, and software design. However there is a significant challenge in making this measurement accurately. There are, typically, two ways that accelerator performance is measured. The first involves programming the performance counters (if available) in the accelerators. The second method uses software measurement tools to identify completion of work by an accelerator, followed by reading the timer register. The second approach is the preferred one—it is more general (works even when specific performance counters are unavailable), more reliable (no need to rely on a library that can program performance counters efficiently) and simpler (no need to learn the intricacies of the performance counters available, what they mean etc.). For example, during lab system bringup, often the performance counters are not readily available, at least they are not available to measurement tools. Even after becoming available, there are bugs to be resolved. In the meantime, the second approach continues to work. That said, the second approach relies on using measurement software that runs on the general purpose core and communicates with an accelerator.
These prior art approaches have problems. The software measurement tools must communicate with an accelerator (either directly or via memory) that is relatively far away—attached via an on-chip or off-chip interconnect. In addition, software must execute at least a handful of instructions to test completion of work by an accelerator, which take time to execute. Moreover, all post-completion measurement steps take time, leading to a best case measurement granularity—that is, the shortest time that software needs to test task completion. While these prior art techniques work when the completion rate of tasks at the accelerator is slower compared to this measurement granularity, often, the completion rate at the accelerator is much higher than the granularity available to a software measurement tool or application running on the general purpose core—after all, that is the whole point of acceleration, to go faster than any general-purpose core or processor. This is especially true of accelerators working on small amounts of data, such as for example, encryption of small Ethernet packets. In such a case, a subject task may have completed at the accelerator a statistically significant amount of time before a performance measurement tool is even able to probe and recognize that the task is completed.