1. Field
Embodiments of the present invention relate generally to the field parallel processing. More particularly, embodiments of the present invention relate to thread-data affinity in a multi-threaded environment.
2. Description of the Related Art
Parallel processing involves simultaneous execution of two or more instruction threads. Performing tasks simultaneously using multiple processors can greatly increase the performance of various applications. There are several parallel processing architectures, including the shared-memory multi-core processor, multiprocessor, and cache coherent non-uniform memory access (cc-NUMA) architectures. In the shared-memory multi-core processor and multiprocessor systems, multiple processing elements (e.g., central processing units (CPUs)) are operated in parallel by an operating system and access memory via a bus interconnect.
In contrast the cc-NUMA multiprocessing architecture has memory separated into close and distant banks. In the shared-memory multi-core processor and multiprocessor systems, all processing elements access a common memory at the same speed. In cc-NUMA, memory on the same processor board as the processing element (local memory) is accessed faster than memory on other processor boards (shared memory), hence the “non-uniform” nomenclature. As a result, the cc-NUMA architecture scales much better to higher numbers of processing elements than the shared-memory multi-core processor and multiprocessor systems. “Cache coherent NUMA” means that caching is supported in the local system. As a practical matter, most large scale NUMA systems are cc-NUMA systems, NUMA and cc-NUMA will be used interchangeable in this description. The differences between NUMA and cc-NUMA are not of particular relevance for the understanding of the various embodiments of the invention described herein.
FIG. 1 is a block diagram of an example cc-NUMA architecture. FIG. 1 shown nodes 1-4. A larger parallel system may have many more nodes, but only four are shown for simplicity. Each node is shown as having one or more processing elements (sometimes also referred to as “cores”), shown as processing elements 5-11. Each node also has a local memory, shown as memories 13-16. This is merely and illustration; nodes may have more than two processing elements and more than one local memory connected to such processing elements via a bus.
A memory local to one processing element may not be local to another processing element. For example, for processing element 5, memory 13 is local and fast to access. However, for processing element 5, memory 15 is not local. Processing element 5 can access memory 15 via the link connecting node 1 and node 3; however, this access will have significantly higher latency than local memory access. Accessing memory 16 from processing element 5 has even higher latency, since two separate links must be traversed.
It is thus apparent, that it is desirable to have data used by an execution thread in local memory. The technical term for this is “thread-data affinity.” In a multi-threaded system, data may be used by one processing element at one time, and then by another non-local processing element at another time. Thread-data affinity refers to the problem of moving data to a memory local to the processing element executing a thread using the data.
There have been several attempts made to address the thread-data affinity problem. One type of approach is extending high-level programming languages to allow data distribution directives inserted by the programmer. However, this method compromises the simplicity of the program model, and cannot handle irregular memory access patterns in a timely fashion. Furthermore, it requires additional programming to be performed.
A second approach uses a deamon (also called a service) executed in the background by the operating system to perform page migration as deemed appropriate for the applications being executed by the operating system. This approach, however, does not exploit the correlation between page migration policies and program semantics and has poor responsiveness
A third approach provides a user with a set of library routines that can be inserted into programs to trigger page migration. This approach, however, is prone to introducing side-effects at compile time when compiler optimizations are preformed.