Modern processors can include performance monitoring units, and additionally software tools can be used to monitor performance. However, existing performance analysis tools and techniques are incapable of accurately analyzing the complex performance issues of highly threaded workloads on multi-threaded many core architectures. Moreover, the growing popularity of thread pool (also known as a “task pool” or “work queue”) style programming models increases the analysis complexity by leveraging software “tasks” that are not directly visible to an operating system (OS), software (SW) analysis tools, or hardware (HW) performance monitoring units. In task-based threading, a software thread is created and assigned to each hardware thread, and the software thread is then presented with a work queue of tasks to be performed. Thus though efficient, this threading model presents challenges for conventional performance analysis.
Developers want to target performance analysis at specific tasks running within individual software threads to realize performance analysis that is not obscured by the complexity of multiple hardware threads per core or by modern thread programming techniques. However, current hardware capabilities and monitoring tools do not support such targeted performance analysis, and instead current performance monitoring software tools often work around this problem via a crude statistical technique, which at best provides a rough approximation.