1. Technical Field
The present invention relates generally to data processing systems and in particular to coherency operations within a data processing system. Still more particularly, the present invention relates to a method and system for maintaining coherency during a data cloning operation that involves “naked” write operations on the fabric of a data processing system.
2. Description of the Related Art
The need for faster and less hardware-intensive processing of data and data operations has been the driving force behind the improvements seen in the field of data processing systems. Recent trends have seen the development of faster, smaller, and more complex processors, as well as the implementation of a multiprocessor configuration, which enables multiple interconnected processors to concurrently execute portions of a given task. In addition to the implementation of the multiprocessor configuration, systems were developed with distributed memory systems for more efficient memory access. Also, a switch-based interconnect (or switch) was implemented to replace the traditional bus interconnect.
The distributed memory enabled data to be stored in a plurality of separate memory modules and enhanced memory access in the multiprocessor configuration. The switch-based interconnect enabled the various components of the processing system to connect directly to each other and thus provide faster/more direct communication and data transmission among components.
FIG. 1 is a block diagram illustration of a conventional multiprocessor system with distributed memory and a switch-based interconnect (switch). As shown, multiprocessor data processing system 100 comprises multiple processor chips 110A-101D, which are interconnected to each other and to other system components via switch 103. The other system components included distributed memory 105, 107 (with associated memory controllers 106, 108), and input/output (I/O) components 104. Additional components (not shown) may also be interconnected to the illustrated components via switch 103. Processor chips 101A-101D each comprise two processor cores (processors) labeled sequentially P1-PN. In addition to processors P1-PN, processor chips 101A-101D comprise additional components/logic that together with processors P1-PN control processing operations within data processing system 100. FIG. 1 illustrates one such component, hardware engine 111, the function of which is described below.
In a multiprocessor data processing system as illustrated in FIG. 1, one or more memories/memory modules is typically accessible to multiple processors (or processor operations), and memory is typically shared by the processing resources. Since each of the processing resources may act independently, contention for the shared memory resources may arise within the system. For example, a second processor may attempt to write to (or read from) a particular memory address while the memory address is being accessed by a first processor. If a later request for access occurs while a prior access is in progress, the later request must be delayed or prevented until the prior request is completed. Thus, in order to read or write data from/to a particular memory location (or address), it is necessary for the processor to obtain a lock on that particular memory address until the read/write operation is fully completed. This eliminates the errors that may occur when the system unknowingly processes incorrect (e.g., stale) data.
Additionally, with faster, more complex, multiprocessor systems, multiple data requests may be issued simultaneously and exist in varying stages of completion. Besides coherency concerns, the processors have to ensure that a particular data block is not changed out of sequence of operation. For example, if processor P1 requires data block at address A to be written and processor P2 has to read the same data block, and if the read occurs in program sequence prior to the write, it is important that the order of the two operations be maintained for correct results.
Standard operation of data processing systems requires access to and movement or manipulation of data by the processing (and other) components. The data are typically stored in memory and are accessed/read, retrieved, manipulated, stored/written and/or simply moved using commands issued by the particular processor executing the program code.
A data move operation does not involve changes/modification to the value/content of the data. Rather, a data move operation transfers data from one memory location having a first physical address to another location with a different physical address. In distributed memory systems, data may be moved from one memory module to another memory module, although movement within a single memory/memory module is also possible.
In order to complete either type of move in current systems, the following steps are completed: (1) processor engine issues load and store instructions, which result in cache line (“CL”) reads being transmitted from processor chip to memory controller via switch/interconnect; (2) memory controller acquires a lock on destination memory location; (3) processor is assigned lock destination memory location (by memory controller); (4) data are sent to processor chip (engine) from memory (source address) via switch/interconnect; (5) data are sent from processor engine to memory controller of destination location via switch/interconnect; (6) data are written to destination location; and (7) lock of destination is released for other processors. Inherent in this process is a built in latency of transferring the data from the source memory location to the processor chip and then from the processor chip to the destination memory location, even when a switch is being utilized.
Typically, each load and store operation moves an 8-byte data block. To complete this move requires rolling of caches, utilization of translation look-aside buffers (TLBs) to perform effective-to-read address translations, and further requires utilization of the processor and other hardware resources to receive and forward data. At least one processor system manufacturer has introduced hardware-accelerated load lines and store lines along with TLBs to enable a synchronous operation on a cache line at the byte level.
FIG. 1 is now utilized to illustrate the movement of data by processor P1 from one region/location (i.e., physical address) in memory to another. As illustrated in FIG. 1 and the directional arrows identifying paths 1 and 2, during the data move operation, data are moved from address location A in memory 105 by placing the data on a bus (or switch 103) along data path 1 to processor chip 101A. The data are then sent from processor chip 101A to the desired address location B within memory 107 along a data path 2, through switch 103.
To complete the data move operations described above, current (and prior) systems utilized either hardware engines (i.e., a hardware model) and/or software programming models (or interfaces).
In the hardware engine implementation, virtual addresses are utilized, and the hardware engine 111 controls the data move operation and receives the data being moved. The hardware engine 111 (also referred to as a hardware accelerator) initiates a lock acquisition process, which acquires locks on the source and destination memory addresses before commencing movement of the data to avoid multiple processors simultaneously accessing the data at the memory addresses. Instead of sending data up to the processor, the data is sent to the hardware engine 111. The hardware engine 111 makes use of cache line reads and enables the write to be completed in a pipelined manner. The net result is a much quicker move operation.
With software programming models, the software informs the processor hardware of location A and location B, and the processor hardware then completes the move. In this process, real addresses may be utilized (i.e., not virtual addresses). Accordingly, the additional time required for virtual-to-real address translation (or historical pattern matching) required by the above hardware model cab be eliminated. Also in this software model, the addresses may include offsets (e.g., address B may be offset by several bytes).
A typical pseudocode sequence executed by processor P1 to perform this data move operation is as follows:
LOCK DST; lock destinationLOCK SRC; lock sourceLD A (Byte 0); AB0 (4B or 8B quantities)ST B (Byte 0); BB0 (4B/8B)INC; increment byte numberCMP; compare to see if doneBC; branch if not doneSYNC; perform synchronizationRL LOCK; release locks
The byte number (B0, B1, B2), etc., is incremented until all the data stored within the memory region identified by address A are moved to the memory region identified by address B. The lock and release operations are carried out by the memory controller and bus arbiters, which assign temporary access and control over the particular address to the requesting processor that is awarded the locks.
Following a data move operation, processor P1 must receive a completion response (or signal) indicating that all the data have been physically moved to memory location B before the processor is able to resume processing other subsequent operations. This ensures that coherency exists among the processing units and the data coherency is maintained. The completion signal is a response to a SYNC operation, which is issued on the fabric by processor P1 after the data move operation to ensure that all processors receive notification of (and acknowledge) the data move operation.
Thus, in FIG. 1, instructions issued by processor P1 initiate the movement of the data from location A to location B. A SYNC is issued by processor P1, and when the last data block has been moved to location B, a signal indicating the physical move has completed is sent to processor P1. In response, processor P1 releases the lock on address B, and processor P1 is able to resume processing other instructions.
Notably, since processor P1 has to acquire the lock on memory location B and then A before the move operation can begin, the completed signal also signals the release of the lock and enables the other processors attempting to access the memory locations A and B to acquire the lock for either address.
Although each of the hardware and software models provides different functional benefits, both possessed several limitations. For example, both hardware and software models have built in latency of loading data from memory (source) up to the processor chip and then from the processor chip back to the memory (destination). Further, with both models, the processor has to wait until the entire move is completed and a completion response from the memory controller is generated before the processor can resume processing subsequent instructions/operations.
The present invention therefore realizes that it would be desirable to provide a method and system for more efficient data move operations. A method and data processing system that enables coherent data clone operations with memory clone-specific responses to “naked” write operations would be a welcomed improvement. These and several other benefits are provided by the present invention.