Modern microprocessors (processors) employ many techniques to achieve high performance. For example, many modern processors are capable of simultaneously executing a plurality of threads. For instance, a processor may include multiple physical cores, each capable of executing independent threads simultaneously with the other cores. Additionally, or alternatively, a single physical processor core may be capable of simultaneously executing two or more threads. This capability is known as simultaneous multi-threading, or SMT (also referred to as hyper-threading). When SMT is used, each physical core is viewed as including two or more “logical” cores that each executes a different thread using shared execution units (e.g., such as multiple arithmetic logic units). In some implementations in which a processor possesses multiple physical cores, each of these physical cores is also an SMT-capable core. In most of these implementations, the processor therefore presents twice the number of logical cores as there are physical cores.
Another technique used to achieve high performance is for a core to execute individual hardware code instructions in an order other than the order in which they were written or, more typically, than the order in which they were generated by a compiler. Such “out of order” execution enables the core to more fully utilize its internal processor resources (e.g., execution units), which are often highly parallelized. For example, if two (or more) hardware instructions are not dependent on each other, a single processor core may be able to execute these instructions in parallel, rather than idly waiting for one instruction to complete prior to beginning execution of another. Out-of-order execution can be applied to many types of hardware instructions, including instructions that perform memory operations (i.e., operations that read from or write to a memory hierarchy, typically including one or more caches and system memory). Due to out-of-order execution and/or memory hierarchy design, memory accessing operations may be perceived by another core or device as occurring in a different order than that prescribed in the original code.
In many cases, multiple threads simultaneously executing at one or more cores are related, such as being part of the same application process. When simultaneously executing threads are related, the hardware instructions executing for one thread may perform memory operations that affect one or more of the other threads, by accessing (i.e., reading from and/or writing to) a memory location in the memory hierarchy that is being used by one or more of the other threads. For example, a thread may access a shared variable (e.g., a global variable), a data structure that is shared by the threads, etc. If memory operations from different threads are executed out-of-order at their respective cores (physical or logical), and/or executed out-of-order by the memory hierarchy, this out-of-order execution could lead to problems if it is not properly dealt with.
For example, a process may include multiple threads that synchronize via one or more synchronization variables. To illustrate, suppose that code for a first thread sets two data variables and a synchronization variable, whose value starts as FALSE. For example:
Data1=A
Data2=B
SyncVariable=TRUE
Suppose further that code for a second thread reads the values of Data1 and Data2, but only when SyncVariable is TRUE. For example:
Temp=SyncVariable
WHILE Temp=FALSE                // Do something that does not involve data A or B        
END WHILE
Read A & B
For correct execution, some ordering constraints in this scenario include both (i) the write to SyncVariable by the first thread must be ordered after the writes by the first thread to Data1 and Data2, and (ii) the read of SyncVariable by the second thread must be ordered before subsequent reads of Data1 and Data2 by the second thread.
To address memory operation re-ordering concerns, modern processors employ hardware memory models that define how memory effects are globally visible in a multi-processor (including multi-core) system. In particular, hardware memory models define how threads can interact through memory, including how they can use shared data such as synchronization variables. Programming languages can further employ software memory models to apply additional restrictions at compile time. In general, a hardware memory model defines what types of out-of-order execution of memory operations are possible when executing multiple threads.
Some processors have hardware memory models that tend to apply many restrictions to out-of-order execution of memory operations, and are thus referred to as having a generally “stronger” memory model. Other processors have hardware memory models tend to apply fewer restrictions to out-of-order execution of memory operations, and are thus referred to as having a generally “weaker” memory model. Memory models can therefore fall on a spectrum from the strongest (e.g., a “sequentially consistent” memory model with no memory reordering) to the weakest (e.g., in which any load or store operation can effectively be reordered with any other load or store operation, as long as it would not modify the behavior of a single, isolated thread).
To illustrate, the x86 family of processor instruction set architectures (ISA's) (e.g., x86, x86-64, referred to herein as x86) are known generally as having a relatively strong memory model, in which machine instructions usually come implicitly with acquire and release semantics. As a result, for most x86 instructions, when one core performs a sequence of writes, every other core generally sees those values change in the same order that they were written. In general, an instruction has “acquire semantics” if other cores will always see its memory effect before any subsequent instruction's memory effect, and an instruction has “release semantics” if other cores will see every preceding instruction's memory effect before the memory effect of the instruction itself. By contrast, the ARM-compatible family of processor ISA's are known generally as having a relatively weak or “relaxed” memory model compared to x86 ISA's, and permit many types of memory operation reordering so long as address dependencies are preserved.
Frequently, it may be desirable to execute an application that was compiled for processors having a first ISA (e.g., x86) on a processor having a second ISA (e.g., ARM). If the application's higher-level source code is available, and it was written in a portable manner, it is usually relatively straightforward to re-compile the application with the second ISA as the target. If source code is not available, however, the hardware instructions (i.e., assembly code) of the application's executable binary need to be translated to instructions of the second ISA. Thus, a translator may translate the first hardware instructions of the first ISA to compatible second hardware instructions of the second ISA.
If the first ISA has a stronger memory model than the memory model of the second ISA, any of the first hardware instructions that access memory may carry implicit acquire and release semantics. Since these semantics are implicit, it is generally not known from the instructions, themselves, what types of required orderings exist. Thus, when emitting second instructions for the second ISA, conventional translators typically insert memory barriers into the second instructions to force the memory operations to execute with ordering restrictions that are similar to those that would have existed in the first ISA. A memory barrier may comprise one or more additional instructions that are emitted in connection with an instruction that performs a memory operation, and that apply ordering constraints to execution of the memory operation. Additionally, or alternately, a memory barrier may comprise an instruction in the second ISA that performs the memory operation while enforcing ordering constraints itself
Since the translator is operating based on implicit acquire and release semantics of individual instructions, rather than a broader understanding of the application code that may be possible if higher-level source code was available, there may be many situations in which it emits barriers that are not actually necessary for correct execution of the application at the second ISA, and that may not have been created had the higher-level source code been compiled with the second ISA as a direct target. As such, these barriers enforce unnecessary ordering constraints for correct execution at the second ISA, generally increase the number of instructions emitted in the second ISA, and harm execution performance of the translated application at the second ISA.