Given the continually increased reliance on computers in contemporary society, computer technology has had to advance on many fronts to keep up with increased demand. One particular subject of significant research and development efforts is parallelism, i.e., the performance of multiple tasks in parallel.
A number of computer software and hardware technologies have been developed to facilitate increased parallel processing. From a software standpoint, multithreaded operating systems and kernels have been developed, which permit computer programs to concurrently execute in multiple “threads” so that multiple tasks can essentially be performed at the same time. Threads generally represent execution entities defining independent paths of execution for a program. For example, for an e-commerce computer application, different threads might be assigned to different customers so that each customer's specific e-commerce transaction is handled in a separate thread. It will be appreciated that threads may be referred to in other computer architectures by terms such as tasks, processes, jobs, etc. As such, it should be understood that the term “thread” as used herein should be considered to be analogous to other types of execution entities used in other computer architectures, irrespective of what those other types of execution entities are called.
From a hardware standpoint, computers increasingly rely on multiple microprocessors to provide increased workload capacity. Furthermore, some microprocessors have been developed that support the ability to execute multiple threads in parallel, effectively providing many of the same performance gains attainable through the use of multiple microprocessors.
A significant bottleneck that can occur in a multi-processor computer, however, is associated with the transfer of data to and from each microprocessor, often referred to as communication cost. Most computers rely on a main memory that serves as the principal working storage for the computer. Retrieving data from a main memory, and storing data back into a main memory, however, is often required to be performed at a significantly slower rate than the rate at which data is transferred internally within a microprocessor. Often, intermediate buffers known as caches are utilized to temporarily store data from a main memory when that data is being used by a microprocessor. These caches are often smaller in size, but significantly faster, than the main memory. Caches often take advantage of the temporal and spatial locality of data, and as a result, often significantly reduce the number of comparatively-slower main memory accesses occurring in a computer and decrease the overall communication cost experienced by the computer.
Often, all of the microprocessors in a computer will share the same main memory, an architecture that is often referred to as Symmetric Multiprocessing (SMP). One limitation of such computers, however, occurs as a result of the typical requirement that all communications between the microprocessors and the main memory occur over a common bus or interconnect. As the number of microprocessors in a computer increases, the communication traffic to the main memory becomes a bottleneck on system performance, irrespective of the use of intermediate caches.
To address this potential bottleneck, a number of computer designs rely on Non-Uniform Memory Access (NUMA), whereby multiple main memories are essentially distributed across a computer and physically grouped with sets of microprocessors and caches into physical subsystems or modules, also referred to herein as “nodes”. The microprocessors, caches and memory in each node of a NUMA computer are typically mounted to the same circuit board or card to provide relatively high speed interaction between all of the components that are “local” to a node. The nodes are also coupled to one another over a network such as a system bus or a collection of point-to-point interconnects, thereby permitting microprocessors in one node to access data stored in another node, thus effectively extending the overall capacity of the computer. Memory access, however, is referred to as “non-uniform” since the access time for data stored in a local memory (i.e., a memory resident in the same node as a microprocessor) is often significantly shorter than for data stored in a remote memory (i.e., a memory resident in another node).
Irrespective of the particular type of multi-processing architecture used, efficient utilization of the hardware resources in a computer often requires a collaborative effort between software and hardware. As noted above, from a software standpoint, much of the work performed by a computer is handled by various threads. To ensure optimal performance, threads are typically assigned (e.g., at the time they are created) to subsets of available computer resources in such a manner that the workload of the computer is evenly distributed among the available computer resources.
For efficient utilization of microprocessors, for example, it is desirable to evenly distribute threads among the available microprocessors to balance the workload of each individual microprocessor, a process referred to as “symmetric” resource allocation. However, given that communication cost can have a significant effect on system performance as well, it is also desirable to logically tie a thread with the data that it will use so that accesses to the data by the thread are localized whenever possible either in a cache, or if in a NUMA computer, at least within the same node. Otherwise, the communication cost of accessing non-localized data may exceed the benefits of the symmetric distribution of threads.
In most computer architectures, an operating system or kernel, and in particular, program code therein, which is hereinafter referred to as resource allocation manager program code, is responsible for allocating memory and processor resources to application programs and their constituent threads. In a multi-node architecture, for example, typically threads are assigned “home nodes”, and the operating system or kernel will attempt to allocate memory and processor resources from a thread's assigned home node to optimize hardware performance, minimize communication costs, and balance workload across the various nodes.
One drawback to conventional resource allocation management schemes, which are implemented entirely within an operating system or kernel, is that the schemes typically allocate hardware resources for application programs in the same manner every time, and irrespective of the types of application programs that are being executed on a computer. This “one size fits all” approach, however, may not result in optimal resource allocation for certain types of application programs.
For example, application programs such as engineering or scientific application programs tend to be highly processor and memory intensive, and require a substantial number of memory accesses during execution. For these types of application programs, it has been found that the amount of memory accesses by the multiple threads executing in such applications necessitates that, whenever possible, all of processor and memory resources utilized by such threads should be highly localized, i.e., for a multi-node computer, should be localized within the same node, or for a single-node computer, should be localized within a limited subset of processor and memory resources. Spreading threads out among a larger set of hardware resources may incur greater communication costs, and degrade overall system performance.
In contrast, application programs such as commercial or interactive application programs, e.g., transaction processing applications, database applications, etc., do not tend to be not as processor and memory intensive as engineering or scientific application programs. Often, a greater concern with such application programs is consistent response time, and as a result, if a local processor or memory resource is not available for a particular thread for an application program, it may be more desirable to allow that thread to utilize other available hardware resources, even if such resources are not local with respect to the hardware resources utilized by other threads for the application program.
Moreover, some application programs may rely on data that is shared with other application programs and/or by multiple threads within the same application program. When such data sharing represents a significant component of application performance, it is often desirable to localize the hardware resources utilized for all of the application programs and/or threads that share the data, and thus maximize the performance of all of such application programs. For application programs that do not share significant data, this concern is not as great.
Given the significant variances in the resource utilization characteristics of different types of application programs, it is difficult to implement a single resource allocation management scheme that optimizes the resource utilization of such application programs. Therefore, a significant need exists for a manner of improving the allocation of hardware resources in a computer that better accounts for the variations in the resource utilization characteristics of different application programs.