1. Field
The present invention relates to the energy efficiency of modern processors, sometimes referred to as central processing units (CPUs), particularly but not exclusively in high performance computing (HPC) applications. Computationally intense and other large-scale applications are usually carried out on HPC systems. Such HPC systems often provide distributed environments in which there is a plurality of processing units or “cores” on which independent sequences of events such as processing threads or processes of an executable can run autonomously in parallel.
2. Description of the Related Art
Many different hardware configurations and programming models are applicable to HPC. A popular approach to HPC currently is the cluster system, in which a plurality of nodes each having one or more multicore processors (or “chips”) is interconnected by a high-speed network. Each node is assumed to have its own area of memory, which is accessible to all cores within that node. The cluster system can be programmed by a human programmer who writes source code, making use of existing code libraries to carry out generic functions. The source code is then compiled to lower-level executable code, for example code at the ISA (Instruction Set Architecture) level capable of being executed by processor types having a specific instruction set, or to assembly language dedicated to a specific processor. There is often a final stage of assembling or (in the case of a virtual machine, interpreting) the assembly code into executable machine code. The executable form of an application (sometimes simply referred to as an “executable”) is run under supervision of an operating system (OS).
Applications for computer systems having multiple cores may be written in a conventional computer language (such as C/C++ or Fortran), augmented by libraries for allowing the programmer to take advantage of the parallel processing abilities of the multiple cores. In this regard, it is usual to refer to “processes” being run on the cores. A (multi-threaded) process may run across several cores within a multi-core CPU.
Each process is a sequence of programmed instructions, and may include one or more threads. A thread can be the smallest sequence of programmed instructions that can be managed by an operating system scheduler. Multiple threads can exist within the same process and often share resources, such as memory, whereas in general processes do not share resources.
One library used by programmers is the Message Passing Interface, MPI, which uses a distributed-memory model (each process being assumed to have its own area of memory), and facilitates communication among the processes. MPI allows groups of processes to be defined and distinguished, and includes routines for so-called “barrier synchronization”, which is an important feature for allowing multiple processes or processing elements to work together. Barrier synchronization is a technique of holding up all the processes in a synchronization group executing a program until every process has reached the same point in the program. This may be achieved by an MPI function call which has to be called by all members of the group before the execution can proceed further.
Alternatively, in shared-memory parallel programming, all processes or cores can access the same memory or area of memory. In a shared-memory model there is no need to explicitly specify the communication of data between processes (as any changes made by one process are transparent to all others). As for an MPI code, barrier synchronization is often required in a thread-parallel code to ensure that all threads are at the same point in the program before execution proceeds further.
Energy efficiency is becoming an increasingly important factor in HPC. Today's largest HPC systems (http://www.top500.org/list/2013/11/) can draw up to 17 MW of energy. As the annual cost of energy is reaching around $1 million per MW (in the USA) the operational expenditure on energy for a HPC facility is beginning to become a critical factor in system design. The proposed power budget for a future exascale system (30 times faster than the current fastest system) is 20 MW (less than a 1.2 times increase in power). Thus, the development of new techniques to optimize power usage within HPC systems (and in other computer systems) is vitally important.
As mentioned above, modern CPUs consist of several physical cores (processing units). For example, the Fujitsu SPARC64™ IXfx CPU contains sixteen cores. The number of logical cores (i.e. cores perceived to be available by an application) may be further increased by the use of hardware multithreading (scheduling two or more threads simultaneously on one physical core).
In order to extract maximum performance from the CPU, an application will generally run multiple threads of execution within the CPU, where different threads can be executed on different logical cores. In some applications there may be more threads than there are logical cores available with the threads being dynamically allocated to cores by the operating system scheduler. HPC applications, on the other hand, tend to assign one thread to each logical core and each thread may be “pinned” to a specific physical core (i.e. the thread is executed from start to finish on this core). These choices reduce the need to move data around the memory hierarchy (since the thread that is working with a given section of memory stays in the same physical location) and can reduce resource contention.
In general, all threads in a parallel code execute independently of one another within the cores that they are pinned to, acting on variables stored within memory that may be either shared (visible to all threads) or private (to one thread). However, periodically there may be a need for the threads to synchronize with one another (e.g. to agree collectively on the value of a particular variable, for one thread to communicate the value of a private variable to other threads or to avoid a “race condition”, where multiple threads write to a shared variable). These synchronization points require each thread to be in the same state. Thus, when the first (and each subsequent) thread arrives at the (barrier) point in the code at which synchronization is to be carried out it must pause execution and wait until all of the other threads also reach the same state. This process of waiting for all threads to be in the same state is known as a barrier (or barrier mechanism), which functions in the same way as the barrier synchronization mentioned above for MPI processes and other processes, but on a per-thread basis, rather than on a per-process basis.
FIG. 1 is a timeline representation showing typical thread barrier operation. The same diagram applies to process barrier operation, by a simple change of “thread” to “process” in the labeling. This is also true of all the other figures.
The case of six threads is shown. Each thread hits the barrier at the point indicated by the black diamond shape. The global barrier is said to start once the first thread reaches the barrier and end when the final thread reaches the barrier (after which normal execution continues). Threads that have reached the barrier actively poll to determine whether or not all other threads have also reached the barrier. This active polling (a “busy wait”) is the shaded region shown for threads 0, 1, 3, 4 and 5. Thread 2 is the last thread to reach the barrier and triggers the end of the barrier and thus the end of the polling). The active polling can consume energy unnecessarily.
Various methods are used to implement thread barriers, but most rely either on a “busy-wait” as explained above (each process repeatedly polls for the value of a variable that determines whether the barrier has ended) or on blocking algorithms that de-schedule threads to prevent their continued execution. The threads can be de-scheduled and re-scheduled by (for example) calls to pthread library routines. So, pthread_cond_wait( ) can be used to tell a thread to wait until another thread (the last to the barrier) makes a call to pthread_cond_broadcast( ) which wakes up all the other threads.
It is desirable to reduce unnecessarily consumed energy associated with barrier operation.