For years, processor designers were able to fully leverage Moore's Law, which states that the density of components integrated on a single chip grows exponentially over time. In conjunction with increasing chip density, chip clock rates have previously been following a trend of doubling approximately every 18 months. However, due to the increasing power requirements of processors, this clock frequency scaling is no longer possible. Instead, processor manufacturers have moved to designing multi-core processor systems, leveraging increasing chip density and possible spatial parallelism while clock rates remain relatively constant. It is predicted that the number of processors in multi-core systems will eventually scale to the 10 s and 100 s. This becomes a significant challenge for the Operating System (OS), which has to determine how to schedule tasks effectively on these complex systems. How will the OS determine how to schedule threads so as to minimize cache contention and which processor(s) meet each task's execution requirements on heterogeneous systems?
Currently a number of hardware counters are included as part of a processor's architecture that enable limited profiling of applications at run time. However, these do not provide sufficient flexibility or information to effectively guide the OS in task assignment. While existing counters report the symptoms of a problem (i.e., how large the cache miss rate is), they do not provide insight into why the problem has occurred and how it could be fixed.
Recent advances in integrated circuit technology have opened the door to very complex computation platforms. These platforms provide the high performance needed for both existing and many emerging applications. Many of these platforms contain multiple processors on a single chip. These modern multicore processors, also known as Multi-Processor Systems-on-Chip (MPSoC), contain multiple processing units sharing the caches and bus/point-to-point interconnects. This intimate sharing of resources among the cores leads to many opportunities for performance optimizations through co-operative sharing of resources, but also may cause performance degradation through shared resource contention. Since the introduction of multicore architectures into mainstream computing, much research effort has been dedicated to finding means for exploiting these opportunities and eliminating the problems. The challenge in these endeavours is that different workloads (programs running on the computer) have very different properties, meaning the resource management policy must also depend on the workload. To that end, researchers have strived for improved observability into performance on multicore processors.
Existing observability tools, such as simple hardware performance counters, do not give enough information to address these issues, and so as a result, many proposals for more complex hardware counter architectures have emerged. These new architectures were a significant improvement over the existing state-of-the-art in that they allowed profound understanding of the properties of workloads and their interactions with multicore hardware. Unfortunately, their implementation in real processors required modifications to the processors underlying operation and architecture, and this proved to be a barrier for bringing these fruitful ideas to real devices. Many of the proposals for new observability enhancements remain research prototypes.