1. Field
The present invention relates to high performance computing (HPC) including programming models, distributed computing, inter-node communication and high performance CPU (central processing unit) instruction set extensions.
2. Description of the Related Art
Increasingly, software is tested and/or improved automatically using various techniques such as auto-tuning. Auto-tuning can automatically enhance the software, to make it perform better. For example, an auto-tuning phase of an application can analyze functionally equivalent implementations to identify the one that best meets the user's objectives. In the overlapping concept of optimization, an objective function related to an operating parameter (such as energy use, time taken for processing, or number of floating point operations per second) is maximized or minimized appropriately. If the objective function measures the performance of a program, then optimizing the objective function “tunes” the program. Equally there may be a search phase to find an item with specified properties among a collection of items. The search may identify possible implementations/executions, or remove less advantageous implementations. If the item found optimizes an objective function, then the search performed can be viewed as an optimization.
Any of these testing techniques may be used during an execution stage or in any other stage of software development and use.
Automatic software testing is of particular value in distributed environments. In such environments, there is a plurality of processing elements or cores on which processing threads of an executable can run autonomously in parallel. The term “processing element” or “core” may be thought of as hardware resources necessary for executing program code instructions.
In parallel distributed systems, there is the possibility of preparing and selecting distributed algorithms (which implement code using a parallelization strategy) and/or, at a lower level, kernels (compiled software code that can be executed on a node of a parallel computing system). It is desirable to provide testing (for example in the form of auto-tuning, optimization and/or search) which is efficient and quick to cater for improved software performance in these and other use cases.
Many different hardware configurations and programming models are applicable to high performance computing. A popular approach to high-performance computing currently is the cluster system, in which a plurality of nodes each having a multicore processor (or “chip”) are interconnected by a high-speed network. The cluster system can be programmed by a human programmer who writes source code, making use of existing code libraries to carry out generic functions. The source code is then compiled to lower-level executable code, for example code at the ISA (Instruction Set Architecture) level capable of being executed by processor types having a specific instruction set, or to assembly language dedicated to a specific processor. There is often a final stage of assembling or (in the case of a virtual machine, interpreting) the assembly code into executable machine code. The executable form of an application (sometimes simply referred to as an “executable”) is run under supervision of an operating system (OS).
To assist understanding of the invention to be described, some relevant techniques in the field will be outlined.
Applications for computer systems having multiple cores may be written in a conventional computer language (such as C/C++ or Fortran), augmented by libraries for allowing the programmer to take advantage of the parallel processing abilities of the multiple cores. In this regard, it is usual to refer to “processes” being run on the cores.
One such library is the Message Passing Interface, MPI, which uses a distributed-memory model (each process being assumed to have its own area of memory), and facilitates communication among the processes. MPI allows groups of processes to be defined and distinguished, and includes routines for so-called “barrier synchronization”, which is an important feature for allowing multiple processes or processing elements to work together. Barrier synchronization is a technique of holding up all the processes in a synchronization group executing a program until every process has reached the same point in the program. This is achieved by an MPI function call which has to be called by all members of the group before the execution can proceed further.
As already mentioned, MPI uses a distributed memory model in which each task has its own local memory. Another approach to parallel programming is shared-memory, where multiple processes or cores can access the same memory or area of memory in order to execute instructions in multiple, concurrent execution paths or “threads”. OpenMP is such a shared-memory processing model.
A synchronization group may be constituted by all the cores of a multicore processor. Then, barrier synchronization is also possible in hardware form, such as by an on-chip logic circuit receiving outputs from the plurality of cores. This gives the advantage of higher speed in comparison with a software-based barrier.
OpenMP provides a so-called “fork-and-join” model, in which a program begins execution as a single process or thread (Master Thread). This thread executes instructions sequentially until a parallelization directive is encountered, at which point the Master Thread divides into a group of parallel Worker Threads in a Parallel region of the program. Typically, each thread is carried out by a respective core, although this is not essential. The worker threads execute instructions in parallel until they reach the end of the Parallel region. After synchronization (see above) the worker threads combine again back to a single Master Thread, which continues sequential instruction execution until the next parallelization directive is reached.