The present application relates to multi-core processing.
A manycore processor provides a large number of processing cores that can support simultaneous executions of many multi-threaded processes. For example, Intel's Xeon Phi coprocessor has 60 processing cores that can execute 240 threads simultaneously. Each processing core supports 4 hardware thread contexts and has private, separate instruction and data L1 caches (32 KB each). Each core also hosts a 512 KB share of the L2 cache that is accessible to all cores. Every processing core also contains a 512-bit wide vector unit to support single-instruction-multiple-data (SIMD) operations. The on-chip interconnect linking the processing cores and the on-chip memory controllers is a ring. The coprocessor is connected to the host processor through PCIe bus.
The Xeon Phi coprocessor runs the Linux operating system to service the processes running on the coprocessor and manage the resources. Having an OS to manage the coprocessor eases the software development effort. For example, porting an existing application running on Linux and the x86 architecture to Xeon Phi usually only requires a recompilation of the source code with proper compiler flags. And as the standard Linux OS, the Xeon Phi's OS provides virtual memory system, file system, process control, scheduling, and naturally supports multiple processes to execute simultaneously on a Xeon Phi coprocessors.
One method of programming the Intel Xeon Phi coprocessor is to use the so-called offload programming model. In the offload programming model a programmer uses compiler pragmas to identify code sections that need to be executed on one or more Xeon Phi coprocessors. These code sections are called offload regions. In addition, a programmer specifies the input and the output data needed by offload regions. The compiler generates Xeon Phi instructions for the offload regions. The compiler also generates code that moves the input and output data of an offload region between the host processor's main memory and the Xeon Phi's device memory through the PCIe bus. At run time, a user program running on the host processor using the offload programming model automatically launches an offload process running on a Xeon Phi coprocessor. The offload process is responsible for the execution of offload regions and the data transfer between the host and the device memory.
To facilitate the development of highly parallel applications, Intel Xeon Phi's software stack also supports various parallel programming models like the popular OpenMP [17]. A programmer can implement an OpenMP code segment in an offload region. The OpenMP threads will be created by the offload process on the Xeon Phi coprocessor to execute the OpenMP code segment.
Many OpenMP applications benefit from thread-to-core bindings on these manycore processors. The thread-to-core bindings prevent the OS scheduler migrate a thread from one processing core to another. Therefore the binding of a thread to a processing core allows the thread to take advantage of the processing core's cache state that the thread has built up over time [10]. Xeon Phi's software stack provides several interfaces that allow such bindings to be applied to threads of a user process running on the coprocessor.
Although the Xeon Phi OS naturally supports simultaneous executions of multiple offload and native processes, the thread-to-core bindings of the concurrent processes very often lead to unbalanced workload across the processing cores. For example, two offload processes may both bind their threads to the same processing cores, while the rest of the processing cores remain idle. Unfortunately, it is almost impossible to have multiple, uncoordinated users set up the affinity bindings by themselves so the workload is spread across all processing cores as evenly as possible.