1. Technical Field
The present application relates generally to an improved data processing device. More specifically, the present application is directed to an apparatus, method, and computer program product for ensuring maximum code motion of accesses to DMA buffers.
2. Description of Related Art
International Business Machines, Inc. has recently developed the next generation microprocessor architecture referred to as the Cell Broadband Engine processor architecture, referred to herein as the Cell processor. With this new architecture, a multiple core system-on-a-chip microprocessor is provided that comprises a master processor, referred to as the Power Processor Unit (PPU), and a plurality of co-processors, referred to as the Synergistic Processing Units (SPUs). Each SPU has an associated local storage device, referred to as the local store (LS), a message flow controller (MFC), and a bus interface unit (BIU). This combination of units is referred to as a Synergistic Processing Element (SPE). The details of the Cell Broadband Engine architecture are described in “Cell Broadband Engine Architecture V1.0” available from the DeveloperWorks website at <www-128.ibm.com/developerworks/power/cell/>.
The use of Synergistic Processing Elements (SPE) in the Cell processor provides many unique challenges that are not generally found in traditional processor designs. One particular challenge facing programmers of the SPE is ensuring correct ordering of data accesses to the SPE's local store (LS) direct memory access (DMA) buffers.
With the SPEs, both the SPU and the MFC may perform transactions on the local store. The SPU is a computational engine that may perform quadword loads and stores from and to the local store. The MFC is a DMA engine that may perform block data transfers between the local store and the effective addresses of the Cell processor's system memory. Typically, DMA transfers are initiated via a series of SPU channel writes to the MFC (see the “Cell Broadband Engine Architecture V1.0” document referenced above). The SPU then waits for the DMA request to complete before accessing the data transferred (in the case of a DMA “get” transfer), or storing new data to the local store DMA buffer for the subsequent transfer (in the case of a DMA “put” transfer).
The act of waiting for a DMA transfer to complete requires a channel write to the MFC_WrTagUpdate channel followed by a channel read of the MFC_RdTagStatus channel. The SPU C/C++ Language Extension Specification specifies that channel intrinsics (instructions) are to be treated as “volatile.” The “volatile” keyword in C/C++ specifies that an object may be updated or modified in ways that are outside of the notice of the compiler. For instance, a memory location serving as a status register may be updated by a hardware device, such as an interface card. The volatile keyword tells the compiler to avoid optimizations on this variable, because such optimizations might interfere with its external modification. Thus, by specifying channel instructions as “volatile,” the compiler is instructed to never reorder, such as for optimization purposes, the channel instructions with respect to each other.
However, this constraint on channel instructions does not ensure that SPU local store accesses to the transfer buffers are not reordered, such as by the optimizing scheduler, with respect to the wait for DMA completion channel commands. As a result, the SPU may be able to load data from DMA buffers whose contents have not been stored in the DMA buffers yet or may be able to store data over existing data that has not been written out to other storage yet.
The standard C language solution to this problem is to declare all DMA transfer buffers as “volatile.” This linguistically will ensure that loads and stores to DMA buffers declared “volatile” will not be reordered with respect to the channel instructions. The problem with this solution is that it over constrains compiler scheduling and optimizations by making all loads and stores to these DMA buffers ordered when some of these loads and stores may be optimized, such as by caching or reordering the loads and stores, without detracting from the integrity of the DMA buffers.