Computer systems execute instructions of various code. Often, the code is not designed for a particular processor, and the codes performance on a given platform can suffer. Effective optimizations can improve performance and reduce power consumption. There has been a great deal of work to develop optimization techniques such as partial redundancy elimination (e.g., eliminating redundant operations), load hoisting (e.g., scheduling loads early in the execution flow), and so on. Unfortunately, these techniques have only been applied with a limited optimization scope. Complicated memory models of modern processors hinder memory operations for multi-threaded programs.
Architectural support helps mitigate the complexity of implementing speculative compiler optimizations. Atomic execution allows a group of instructions to be enclosed within a region and executed atomically (namely all or none of the instructions are executed) and in an isolated manner (in that no intermediate results of region are exposed to the rest of the system).
While eliminating much of the burden to implement speculative optimizations, existing hardware designs for atomic execution impose unnecessarily strict memory ordering constraints on underlying hardware platforms for relaxed memory models such as weak consistency and total store ordering (TSO). When applied to multi-threaded applications, atomic regions restrict reordering of memory operations among different atomic regions. Atomic regions are executed on a serializable schedule (that is, the effect of their execution has to be as if they are executed one by one). As a result, memory operations have to be totally ordered (such that all processors agree in their global order of execution). Accordingly performance optimizations are limited.
Multi-core processors are found in almost all computing segments today, including servers, desktops and SoC. The move to these multi-core processor systems necessitates the development of parallel programs to take advantage of performance. Programming a multi-core processor system, however, is a complex task because of the non-deterministic nature of the software execution on these systems. This non-determinism comes from many reasons, including the multitude ways in which the different threads of execution interleave in shared memory, making the reproduction and the understanding of a program execution difficult.