Field of the Disclosure
This disclosure relates generally to managing accesses to shared resources in a multithreaded environment, and more particularly to systems and methods for performing concurrency restriction and throttling over contended locks.
Description of the Related Art
In a multiprocessor environment with threads and preemptive scheduling, threads can participate in a mutual exclusion protocol through the use of lock or “mutex” constructs. A mutual exclusion lock can either be in a locked state or an unlocked state, and only one thread can hold or own the lock at any given time. The thread that owns the lock is permitted to enter a critical section of code protected by the lock or otherwise access a shared resource protected by the lock. If a second thread attempts to obtain ownership of a lock while the lock is held by a first thread, the second thread will not be permitted to proceed into the critical section of code (or access the shared resource) until the first thread releases the lock and the second thread successfully claims ownership of the lock.
In modern multicore environments, it can often be the case that there are a large number of active threads, all contending for access to a shared resource. As multicore applications mature, situations in which there are too many threads for the available hardware resources to accommodate are becoming more common. This can been seen in component-based applications with thread pools, for example. Often, access to such components are controlled by contended locks. As threads are added, even if the thread count remains below the number of logical CPUs, the application can reach a point at which aggregate throughput drops. In this case, if throughput for the application is plotted on the y-axis and the number of threads is plotted on the x-axis, there will be an inflection point beyond which the plot becomes a concave plot. Past inflection point, the application encounters “scaling collapse” such that, as threads are added, performance drops. In modern layered component-based environments it can difficult to determine a reasonable limit on the thread count, particularly when mutually-unaware components are assembled to form an application.
Current trends in multicore architecture design imply that in coming years, there will be an accelerated shift away from simple bus-based designs towards distributed non-uniform memory-access (NUMA) and cache-coherent NUMA (CC-NUMA) architectures. Under NUMA, the memory access time for any given access depends on the location of the accessed memory relative to the processor. Such architectures typically consist of collections of computing cores with fast local memory (as found on a single multicore chip), communicating with each other via a slower (inter-chip) communication medium. In such systems, the processor can typically access its own local memory, such as its own cache memory, faster than non-local memory. In some systems, the non-local memory may include one or more banks of memory shared between processors and/or memory that is local to another processor. Access by a core to its local memory, and in particular to a shared local cache, can be several times faster than access to a remote memory (e.g., one located on another chip). Note that in various descriptions herein, the term “NUMA” may be used fairly broadly. For example, it may be used to refer to non-uniform communication access (NUCA) machines that exhibit NUMA properties, as well as other types of NUMA and/or CC-NUMA machines.
On large cache-coherent systems with Non-Uniform Memory Access (CC-NUMA, sometimes shortened to just NUMA), if lock ownership migrates frequently between threads executing on different nodes, the executing program can suffer from excessive coherence traffic, and, in turn, poor scalability and performance. Furthermore, this behavior can degrade the performance of other unrelated programs executing in the system.