1. Technical Field
The present invention relates generally to computer systems and more specifically to memory access operations on a computer system with a weakly-ordered architecture. Still more particularly, the present invention relates to memory access operations of a well-behaved application executing on a computer system with a weakly-ordered processor and memory subsystem.
2. Description of the Related Art
Memory access operations of data processing systems were traditionally executed in order (i.e., the order in which the instructions are written in the application code) using highly-order processors and completed in order at the memory subsystem. With highly-ordered processors, such as Intel's X86 processors, memory access instructions (e.g., store instructions) strictly follow the processing order to ensure that no conflicts occur at the memory subsystem.
Advancements in processor and cache memory technology have led to the creation of weakly-ordered processors (e.g., International Business Machine's “PowerPC” family of processors), which enable weakly-ordered (or out-of-order) processing of instructions (including memory access instructions) and are typically faster than the highly-ordered processors. Thus, unlike the highly ordered processor, a weakly ordered processor typically processes some instructions, including memory access instructions, out-of-order relative to each other.
In order to further enhance performance, state-of-the-art data processing systems often utilize multiple processors which concurrently execute portions of a given application/task. These multiple processor (MP) data processing systems (hereinafter referred to as “MDPS”) often utilize a multi-level memory hierarchy to reduce the access time required to retrieve data from memory. A MDPS may include a number of processors, each with an associated level-one (L1) cache, a number of level-two (L2) caches, and a number of modules of system memory. Typically, the memory hierarchy is arranged such that each L2 cache and system memory module is coupled to a system bus or interconnect switch, such that an L2 cache within the MDPS may access data from any of the system memory modules coupled to the bus or interconnect switch
Because each of the number of processors within a MDPS may modify data, MDPS typically employ a protocol to maintain memory coherence. For example, MDPS utilizing PowerPC RISC processors utilize a MESI or similar coherency protocol. Those skilled in the art are familiar with such coherency protocols.
On-the-fly instruction translation between different processor types in a MDPS is becoming more viable as processor technology moves towards the faster, weakly-ordered processors. When one or more weakly-ordered processors are being utilized to execute instructions of a well-behaved application written for a highly ordered processor, protecting the order of memory access instructions at the memory subsystem is handled by introducing memory barrier instructions.
Sync instructions are issued after each store operation of a translated, well-behaved application to flush the updated values of a store operation from the processor cache (e.g., L1 write back cache) to a point of coherency in the memory subsystem. The sync provides visibility to the other processors of updates to a memory location by the particular processor executing instructions of the well-behaved application. Thus, in conventional MDPS, maintaining the order of stores while allowing visibility of the store operations to other processors executing instructions of the well-behaved application requires that each store operation executed by a weakly-ordered processor be followed by a sync instruction.
When multiple different processors are able to update a memory block during execution of an application, serialization of these updates is provided using a lock instruction. The lock instruction is a specialized instruction, which enables a processor that is executing an application being concurrently processed by multiple processors to automically update the memory block before another processor is permitted to update the memory block. Locks are thus provided when executing well-behaved applications to provide some level of serialization in updating these shared memory blocks. Those skilled in the art are familiar with the use of locks to enable serialized access of multiple processors to specific blocks memory.
In conventional systems, translation of lock instructions of a well-behaved application for execution on a weakly-ordered processor involves identifying the lock instructions, translating the lock instruction, then providing a following sync instruction. Conventionally, when a lock is being acquired, the acquiring processor issues a sync to make the lock visible to the other processors. The other processors would then not update the memory block until the lock is released. The lock is later released using a simple store operation targeting the lock address.
During the lock phase, once the lock is taken, multiple intermediate store instructions may be executed by the processor, and some of these updates may not be made visible to the other processors. There is no way of knowing when the lock is released and/or determining which stores have been made visible to the other processors. Thus, with conventional systems, syncs have to be introduced after each intermediate store operation to make the store visible to the other components (at the point of coherency) as they are occurring.
Thus, when performing on-the-fly translation of the well-behaved application code into instructions for executing on the weakly-ordered processor, each store operation that affects/updates the point of coherency is immediately followed by a sync to insure visibility. Given the substantial number of memory operations that may be scheduled during execution of the well-behaved application, sync operations are required to be inserted into the execution stream for the weakly-order processor to make the intermediate updates visible to the other processors.
There are several performance limitations with the on-the-fly translation of application code of a well-behaved application for execution on a weakly-ordered processor (i.e., application created for execution within a highly-ordered architecture being executed within a weakly-ordered architecture). As is clear from the above description, the change in processor architecture causes the translation of the instructions to not be sufficient to guarantee correct operation of the application, and syncs are introduced into the code after each store instruction.
Thus, when a well-behaved application is being executed, and protected shared structures are accessed within the context of locks, there is a built in penalty attributed to the application. This penalty is caused by the overhead of ensuring synchronization for each store instruction. While issuing a sync instruction after each store provides a solution to translating between a highly ordered architecture and a weakly ordered architecture, testing has shown that the overhead introduced for issuing these sync instructions after each store is very significant.
In one example, an application compiled for the IBM Power platform (i.e., with no on-the-fly instruction translation) exhibited a near 200% degradation in performance when a sync instruction is inserted after each store. The overhead may change depending on the application, however, most applications typically have a very high percentage of store operations, and thus, for most well-behaved applications, this overhead remains very significant.