1. Field of the Disclosure
This disclosure relates to shared statistics counters, and more specifically to techniques for improving the performance of applications that include accesses to shared statistics counters.
2. Description of the Related Art
Current trends in multicore architecture design imply that in coming years, there will be an accelerated shift away from simple bus-based designs towards distributed non-uniform memory-access (NUMA) and cache-coherent NUMA (CC-NUMA) architectures. Under NUMA, the memory access time for any given access depends on the location of the accessed memory relative to the processor. Such architectures typically consist of collections of computing cores with fast local memory (e.g., memory that is closely coupled to the processor and/or that is located on the same single multicore chip), communicating with each other via a slower (inter-chip) communication medium. In such systems, the processor can typically access its own local memory, such as its own cache memory, faster than non-local memory. In some systems, the non-local memory may include one or more banks of memory shared between processors and/or memory that is local to another processor. Some systems, including many NUMA systems, provide a non-uniform communication architecture (NUCA) property, in which the access time to caches of other processor cores varies with their physical distance from the requesting core. In these systems, access by a core to its local memory, and in particular to a shared local cache, can be several (or many) times faster than access to a remote memory (e.g., a cache located on another chip).
Most large software systems use statistics counters for performance monitoring and diagnostics. For example, statistics counters are of practical importance for purposes such as detecting excessively high rates of various system events, or for mechanisms that adapt based on event frequency. While single-threaded statistics counters are trivial, commonly-used naïve concurrent implementations quickly become problematic, especially as thread counts grow. For example, as systems grow and as statistics counters are used in increasingly Non-Uniform Memory Access (NUMA) systems, commonly used naïve counters impose scalability bottlenecks and/or such inaccuracy that they are not useful. In particular, these counters (when shared between threads) can incur invalidation traffic on every modification of the counter, which is especially costly on NUMA machines.
The ability to execute transactions in parallel is a key to scalable performance. However, the use of shared counters for collecting statistics (e.g., statistics on how often a piece of code is executed, how many elements are in a hash table, etc.) can negatively impact transactional success rates when accesses to the counters occur within transactions (since any two updates to a shared counter by different transactions or threads will potentially conflict with each other). Some previous approaches to solving this problem involve moving the operations that update the counter outside of the transactions, thereby changing the semantics of the program, or implementing complicated and expensive support for “transactional boosting”, which is not applicable in all contexts.
For these and other reasons, application designers face difficult tradeoffs involving the latency imposed on lightly contended counters, the scalability and (in some cases) accuracy of heavily contended counters, and various probe effects.