Hardware/Software (SW/HW) Co-Designed Architecture is a promising technique for modern architecture innovation, and dynamic binary optimization is an important component for HW/SW co-designed architecture. With the advance in Transactional Memory (TM) or Hardware Lock Elision (HLE) techniques, there are proposals to leverage atomic regions supported by TM/HLE for dynamic binary optimization. Since the atomic regions are guaranteed to execute atomically, consistently and in isolation, the code within an atomic region can be reordered without the concern of interactions between different threads. However, due to the strict memory order across locked sections, the atomic regions supported by TM/HLE targeting lock elision unnecessarily impose stronger memory order than what is needed for dynamic binary optimization techniques in X86, and the stronger memory order usually leads to inefficient architecture implementation.
In X86, memory instructions retire from a CPU in their program order (i.e. in-order retire). However, retired store data (i.e. senior stores) may be buffered in an internal store-buffer in their program order and written to the cache/memory later. So, execution of memory instructions in X86 may be viewed as having two stages. In the first stage, the memory instructions retire from the CPU and follow their original program order. After the first stage, the store data stays in the store buffer waiting for the second stage. In the second stage, the load instructions do nothing but the store instructions need follow their original program orders to write data back from store-buffer to the cache (i.e. in-order write-back). Thus, in x86 both stages execute in order. Logically, we can view that the load instructions access memory instantly at the end of first stage and the store instructions access memory instantly at the end of second stage. So X86 allows reorder of memory accesses between an earlier store and a later load, if they access different memory. However, x86 does not allow any reorder of memory accesses between two load instructions or two store instructions due to in-order retire and in-order retire-back in both stages. X86 also prohibits reorder of memory accesses between an earlier load and a later store instruction in dynamic binary optimization.
The two stages make X86 much more efficient than architectures implementing sequential consistency. A store instruction can retire without waiting for its data writing back to the cache. That eliminates the expensive stalls on retirement due to the store misses. For supporting strict order of memory access on memory instructions when necessary, X86 allows to use the expensive fence instruction (including lock instructions because in X86, lock instructions also act as a fences for memory instructions across lock instructions) to enforce the strict order of memory access between memory instructions. The implementation of a fence instruction synchronizes the two stages by merging them into one single stage. So, a fence instruction cannot retire until all the senior stores are written to the cache. In this way, one can enforce the strict order of memory accesses on memory instructions across the fence. Of course, there is overhead in fence instruction in waiting for the drain of senior stores.
There have been many advances in Transactional Memory (TM) and Hardware Lock Elision (HLE) techniques. The term “Transactional Memory” refers to attempts to simplify parallel programming by allowing a group of load and store instructions to execute in an atomic manner. A transaction in this context is a piece of code that executes a series of reads and writes to shared memory. These reads and writes logically occur at a single instant in time, thus intermediate states are not visible to other (successful) transactions. The term “Lock Elision” trying to eliminate a lock from program code that contains a lock. Locks can only be removed from inside atomic regions.
Existing TM/HLE techniques implement atomic regions (or transactions). Besides the instruction retirement from CPU, each memory instruction in the atomic region also needs to commit from the speculative cache. All the instructions in an atomic region have either a single stage of atomic commit or complete rollback. Although atomic regions may enable many more binary optimizations, implementation of atomic regions has some inherent inefficiency. One issue encountered when with the implementation of an atomic region in x86 is that the atomic commit requires all of the stores in the region to be drained from the store buffer to the cache before all the memory instructions in the region can commit. Waiting for the draining of stores may stall retirement of any instructions occurring after the atomic region. Since atomic regions are certain to execute atomically, consistently and in isolation, the code within an atomic region can be reordered without the concern for interactions between different threads. However, due to the strict order of memory accesses on memory instructions across locked sections boundary, the atomic regions supported by TM/HLE targeting lock elision unnecessarily impose more strict order of memory accesses than what is needed for dynamic binary optimization in x86. The more strict order usually leads to less efficient architecture implementation.
To date, there has been very little, if any, research and work concerning the development of two commit stages for regions targeting dynamic binary optimization. Existing TM/HLE techniques, targeting speculative lock elision, implement atomic regions with a single stage atomic commit.