Advances in semi-conductor processing and logic design have permitted an increase in the amount of logic that may be present on integrated circuit devices. Previously, on single thread processors, optimization of code, such as binary code, was allowed to be overly aggressive, because there was no fear of interference by other threads of execution. Yet, computer system configurations have evolved from a single or multiple integrated circuits in a system to multiple cores, multiple hardware threads, and multiple logical processors present on individual integrated circuits. A processor or integrated circuit typically comprises a single physical processor die, where the processor die may include any number of cores, hardware threads, or logical processors. The ever increasing number of processing elements—cores, hardware threads, and logical processors—on integrated circuits enables more tasks to be accomplished in parallel. This evolution from single threaded processors to more parallel, multi-threaded execution has resulted in limits to code optimization.
For example, Pseudo Code A (see FIG. 12a) illustrates optimization of binary code where the loads from memories at [r2] and [r2+4] are hoisted out of a loop to a header block (B3) by Partial Redundancy Load Elimination (PRLE) optimization. And the store to memory at [r2+4] is sunk out of the loop to a tail block (B4) by Partial Dead Store Elimination (PDSE) optimization. This optimization may work in a single threaded environment. However, in multi-threaded applications other threads may write to/read from memory at [r2] or [r2+4] during the loop execution, which potentially results in invalid execution due to the change in the execution order of the memory operations.