Computer systems execute instructions of various code. Oftentimes, the code is not designed for a particular processor, and its performance on a given platform can suffer. Effective compiler optimizations can improve performance and reduce power consumption. There have been decades of work to develop optimization techniques such as partial redundancy elimination (e.g., eliminating redundant operations), load hoisting (e.g., scheduling loads early in the execution flow), and so on. Unfortunately, these techniques, effective in principle, are often not easy to adopt or can be applied only with a limited optimization scope. Complex control flows found frequently in many integer and enterprise applications demand sophisticated recovery code in case speculative compiler optimizations fail. Complicated memory models of modern processors hinder the compiler from rescheduling memory operations aggressively for multi-threaded programs.
Architectural support for atomic execution helps mitigate the complexity of implementing speculative compiler optimizations. A hardware primitive for atomic execution allows a group of instructions to be enclosed within a region and executed atomically (namely all or none of the instructions are executed) and in an isolated manner (in that no intermediate results of region are exposed to the rest of the system). Using the primitive, the compiler can avoid generating complex compensation code for speculative optimizations by simply undoing the failed speculative execution of the region and restarting it without speculation. The atomic execution of memory operations in the region allows the compiler to reorder the operations aggressively within the region.
While eliminating much of the compiler's burden to implement speculative optimizations, existing hardware designs for atomic execution impose unnecessarily strict memory ordering constraints on underlying hardware platforms for relaxed memory models such as weak consistency and total store ordering (TSO). In the course of atomically executing memory operations in an atomic region, the boundary of the region behaves as a memory fence to make the memory operations visible to the rest of the system when the region commits. These memory fences restrict memory operations from being executed out of the program order beyond region boundaries, even when atomic regions are used to optimize a part of a code segment where memory operations can be executed out of order in a relaxed memory model (e.g., optimizing the code along a hot path for a single-threaded application whose memory does not access synchronization variables and therefore can be executed out of order weak consistency). Moreover, applied to multi-threaded applications, atomic regions restrict reordering of memory operations among different atomic regions as well. They are executed on a serializable schedule (that is, the effect of their execution has to be as if they are executed one by one). As a result, memory operations have to be totally ordered (such that all processors agree their global order of execution). Accordingly performance optimizations are limited.