Modern microprocessors are very sophisticated compared to those developed two or three decades ago. Some microprocessors are packaged as an individual, discrete device for assembly onto a circuit board, while others are implemented as a “core” or a “processing element” (PE) in a manner which they may be combined with other functions into an integrated circuit (e.g. multiple processors in one package, a processor with support peripherals in another package, etc.).
A key driver for the increased complexity has been the need to increase single-threaded performance. High-performance microprocessors rely on deep execution pipelines, speculative execution and advanced prediction capabilities. In addition, in recent years multi-threading has been introduced with the aim of addressing the latency cost associated with accessing memory. Although this does not improve single-threaded performance, it offers an increase in the overall processing bandwidth of the computing system.
The complexity introduced with these advanced features makes it increasingly difficult for software designers to ensure there software will make optimal use of the underlying hardware.
Furthermore, given the heavy reliance on prediction and speculation, being able to resolve hardware events, such as cache misses and branch mispredicts, has become increasingly important. These issues means that to develop high performance code there is a need for detailed analysis of how the code runs on the microprocessor.
An important part of this analysis is to associate performance events to the source code, so that programs can take steps towards optimization. This is achieved using hardware profiling mechanisms provided in modern microprocessors, which allow the identification of the address(es) of instruction(s) which cause performance hazards, such as cache misses, on microprocessor.
Typical “profiler” configure the microprocessor hardware to count an interesting event, such as cache misses, so that when the number of cache misses exceed a threshold specified by the configuration, the hardware throws an interrupt, and additional program(s) can capture the exact effective address of the event that threw the interrupt.