Given the continually increased reliance on computers in contemporary society, computer technology has had to advance on many fronts to keep up with increased demand. One particular subject of significant research and development efforts is parallelism, i.e., the performance of multiple tasks in parallel.
A number of computer software and hardware technologies have been developed to facilitate increased parallel processing. From a software standpoint, multithreaded operating systems and kernels have been developed, which permit computer programs to concurrently execute in multiple “threads” so that multiple tasks can essentially be performed at the same time. Threads generally represent independent paths of execution for a program. For example, for an e-commerce computer application, different threads might be assigned to different customers so that each customer's specific e-commerce transaction is handled in a separate thread.
From a hardware standpoint, computers increasingly rely on multiple microprocessors to provide increased workload capacity. Furthermore, some microprocessors have been developed that support the ability to execute multiple threads in parallel, effectively providing many of the same performance gains attainable through the use of multiple microprocessors.
A significant bottleneck that can occur in a multi-processor computer, however, is associated with the transfer of data to and from each microprocessor, often referred to as communication cost. Most computers rely on a main memory that serves as the principal working storage for the computer. Retrieving data from a main memory, and storing data back into a main memory, however, is often required to be performed at a significantly slower rate than the rate at which data is transferred internally within a microprocessor. Often, intermediate buffers known as caches are utilized to temporarily store data from a main memory when that data is being used by a microprocessor. These caches are often smaller in size, but significantly faster, than the main memory. Caches often take advantage of the temporal and spatial locality of data, and as a result, often significantly reduce the number of comparatively-slower main memory accesses occurring in a computer and decrease the overall communication cost experienced by the computer.
Often, all of the microprocessors in a computer will share the same main memory, an architecture that is often referred to as Symmetric Multiprocessing (SMP). One limitation of such computers, however, occurs as a result of the typical requirement that all communications between the microprocessors and the main memory occur over a common bus or interconnect. As the number of microprocessors in a computer increases, the communication traffic to the main memory becomes a bottleneck on system performance, irrespective of the use of intermediate caches.
To address this potential bottleneck, a number of computer designs rely on Non-Uniform Memory Access (NUMA), whereby multiple main memories are essentially distributed across a computer and physically grouped with sets of microprocessors and caches into physical subsystems or modules. The microprocessors, caches and memory in each physical subsystem of a NUMA computer are typically mounted to the same circuit board or card to provide relatively high speed interaction between all of the components that are “local” to a physical subsystem. The physical subsystems are also coupled to one another over a network such as a system bus or a collection of point-to-point interconnects, thereby permitting microprocessors in one physical subsystem to access data stored in another physical subsystem, thus effectively extending the overall capacity of the computer. Memory access, however, is referred to as “non-uniform” since the access time for data stored in a local memory (i.e., a memory resident in the same physical subsystem as a microprocessor) is often significantly shorter than for data stored in a remote memory (i.e., a memory resident in another physical subsystem).
Therefore, from a communication cost standpoint, performance is maximized in a NUMA computer by localizing data traffic within each physical subsystem, and minimizing the number of times data needs to be passed between physical subsystems.
Efficient utilization of the hardware resources in a computer often requires a collaborative effort between software and hardware. As noted above, from a software standpoint, much of the work performed by a computer is handled by various threads. To ensure optimal performance, threads are typically assigned to subsets of available computer resources in such a manner that the workload of the computer is evenly distributed among the available computer resources.
For efficient utilization of microprocessors, for example, it is desirable to evenly distribute threads among the available microprocessors to balance the workload of each individual microprocessor, a process referred to as “symmetric” resource allocation. However, given that communication cost can have a significant effect on system performance as well, it is also desirable to logically tie a thread with the data that it will use so that accesses to the data by the thread are localized whenever possible either in a cache, or if in a NUMA computer, at least within the same physical subsystem. Otherwise, the communication cost of accessing non-localized data may exceed the benefits of the symmetric distribution of threads. Typically, the tying of data with a thread requires human decisions to associate threads of a common type with physically localized memory, processors, and associated resources.
In a symmetric resource management scheme, threads are distributed at activation time, e.g., whenever threads are created or reactivated. Activated threads are typically assigned to the most available, or least loaded, resources or sets of resources. The non-uniform distribution of resources such as memory resources to address communication costs, however, is typically not implemented in such an automated and transparent manner. Rather, non-uniform resource management often requires substantial user analysis and custom configuration, including, for example, custom programming of computer programs to specifically address resource allocation issues.
Resource management is more desirably handled at the operating system or kernel level of a computer, and independent of any specific programming techniques applied to the applications or other computer programs that may be installed on a computer. In particular, resource management, when embedded in an operating system or kernel, requires no specific customization of a higher level computer program to support the optimal allocation of computer resources, and thus provides performance benefits to potentially all computer programs that are executing on a given computer. Particularly in NUMA computers, where performance benefits are achieved through the localization of thread-utilized resources within individual physical subsystems, it would be highly desirable to implement efficient resource allocation in a more transparent manner, and without requiring significant customization.