1. Technical Field
The present invention relates generally to data processing systems and in particular to data operations within data processing systems. Still more particularly, the present invention relates to operations that move memory data during processing on a data processing system.
2. Description of the Related Art
Standard operation of data processing systems requires access to and movement and/or manipulation of data by the processing components. Application data are typically stored in memory and are read/retrieved, manipulated, and stored/written from one memory location to another. Also, the processor may also perform a simple move (relocation) of data using a series of load and store commands issued by the processor when executing the application code.
With conventional data move operations, the processor triggers the operating system to transfer data from one memory location having a first physical (real) address to another location with a different physical (real) address. Completing the data move operation typically involves a number of steps, including: (1) the processor issues a particular sequence of load and store instructions, which result: (a) a TLB performs an address translation to translate the effective addresses of the processor issued operation into corresponding real address associated with the real/physical memory: and (b) a memory or cache controller performing a cache line read or memory read of the data; (2) the TLB passes the real address of the processor store instruction to the memory controller (via a switch/interconnect when the controller is off-chip); (3) the memory controller acquires a lock on the destination memory location (identified with a real address) to prevent overwrite of the data during the data move operation; (4) the memory controller assigns the lock to the processor; (5) the processor receives the data from the source memory location (identified with a real address); (6) the processor sends the data to the memory controller; (7) the memory controller writes the data to the destination location; (8) the memory controller releases the lock on the destination memory location; and (9) a SYNC completes on the system fabric to inform the processor that the data move has finally completed and ensure that the memory subsystem retains the data coherency exists among the various processing units.
Inherent in the above process are several built-in latencies, which force the processor to wait until the end of most of the above processes before the processor may resume processing subsequently received instructions. Examples of these built in latencies include: (a) the TLB having to convert the effective address (EA) of the operation to the corresponding real address via the TLB or ERAT to determine which physical memory location that EA is pinned to; (b) the memory controller retrieving the data from the source memory location, directing the sourced data to the processor chip and then forwarding the data from the processor chip to the destination memory location; and (c) and lock acquisition process.
Generally, data operations are first completed at the user-level and then at the operating system (OS) level. For example, actual movement and modification of physical data within the distributed memory is provided for at the operating system level with real addresses corresponding to the real address space (of distributed memory) at which the data physically resides. However, similar operations are first provided for at the application or user level (via application code executing on the processor node) with virtual addresses (or effective addresses) utilized by the processor within a representative virtual address space. At the OS level, the actual movement (copying) of physical data is performed by one or more mechanisms associated with the interconnect.
In distributed data processing systems, in which a single job may have multiple tasks spread among multiple different nodes, each node may support a separate memory with separate mapping of a subset of effective addresses to real address space for that task/node. With these distributed systems, a call to move data is passed to the OS, which initiates a series of time-intensive and processor-intensive processes to determine the physical location of the real addresses to complete the data move. OS-level processing in a distributed system having multiple processing nodes requires a large number of operations at the node interconnect (or switch), and thus the data move incurs substantially large latencies.
Additionally, in most conventional systems, a large portion of the latency in performing data operations, such as with memory moves, involves the actual movement of the data from the first real address location (the source location) to the second real address location (the destination location). During such movement, the data is pinned to a specific real address to prevent the occurrence of a manage exception. The processor has to wait on completion of the address translation by the TLB and acquisition of the lock before proceeding with completing the operation and subsequent operations. Developers are continually seeking ways to improve the speed (reduce the latency) of such memory access data operations.