1. Technical Field
The present invention relates in general to data processing systems and in particular to managing access to shared data in a data processing system. Still more particularly, the present invention relates to a system, method and computer program product for enhancing store conditional behavior to improve subsequent load efficiency.
2. Description of the Related Art
In shared memory multiprocessor (MP) data processing systems, each of the multiple processors in the system may access and modify data stored in the shared memory. In order to synchronize access to a particular granule (e.g., cache line) of memory between multiple processing units and threads of execution, load-reserve and store-conditional instruction pairs are often employed. For example, load-reserve and store-conditional instructions have been implemented in the PowerPC® instruction set architecture with operation codes (opcodes) associated with the LWARX and STWCX mnemonics, respectively (referred to hereafter as LARX and STCX). The goal of load-reserve and store-conditional instruction pairs is to load and modify data and then to commit the modified data to coherent memory only if no other thread of execution has modified the data in the interval between the load-reserve and store-conditional instructions. Thus, updates to shared memory can be synchronized without the use of an atomic update primitive that strictly enforces atomicity.
With reference now to FIG. 1, a block diagram of a conventional MP computer system supporting the use of load-reserve and store-conditional instructions to synchronize updates to shared memory is illustrated. As shown, computer system 100 includes multiple processing units 102a-102b for processing data and instructions. Processing units 102 are coupled for communication to a system bus 104 for conveying address, data and control information between attached devices. In the depicted embodiment, the attached devices include a memory controller 106 providing an interface to a system memory 108 and one or more host bridges 110, each providing an interface to a respective mezzanine bus 112. Mezzanine bus 112 in turn provides slots for the attachment of additional devices, which may include network interface cards, I/O adapters, non-volatile storage device adapters, additional bus bridges, etc.
As further illustrated in FIG. 1, each processing unit 102 includes a processor core 120 containing an instruction sequencing unit 122 for fetching and ordering instructions for execution by one or more execution units 124. The instructions and associated data operands and data results are stored in a multi-level memory hierarchy having at its lowest level system memory 108, and at its upper levels L1 cache 126 and L2 cache 130. The data within the memory hierarchy may generally be accessed and modified by multiple processing units 102a, 102b. 
L1 cache 126 is a store-through cache, meaning that the point of cache coherency with respect to other processing units 102 is below the L1 cache (e.g., at L2 cache 130). L1 cache 126 therefore does not maintain true cache coherency states (e.g., Modified, Exclusive, Shared, Invalid) for its cache lines, but only maintains valid/invalid bits. Store operations first complete relative to the associated processor core 120 in the L1 cache and then complete relative to other processing units 102 in L2 cache 130.
As depicted, in addition to the L2 cache array 140, L2 cache 130 includes read-claim (RC) logic 142 for managing memory access requests by the associated processor core 120, snoop logic 144 for managing memory access requests by other processing units 102, and reservation logic 146 for recording reservations of the associated processor core 120. Reservation logic 146 includes at least one reservation register comprising a reservation address field 148 and a reservation flag 150.
FIG. 2A depicts the manner in which a load-reserve (e.g., LARX) instruction is processed in data processing system 100 of FIG. 1. As shown, the process begins at block 200, which represents the execution of a LARX instruction by execution units 124 of processing unit 102a in order to determine the target address from which data is to be loaded. Following execution of the LARX instruction, the process pass to step 202, which illustrates processor core 120 issuing a LARX operation corresponding to the LARX instruction to RC logic 142 within L2 cache 130. As depicted at bock 204, RC logic 142 stores the address of the reservation granule (e.g., cache line) containing the target address in reservation address field 148 and sets reservation flag 150. Reservation logic 146 then begins monitoring for an indication by snoop logic 144 that another processing unit 102 has updated the cache line containing the target address. The process then passes to step 206, which depicts L1 cache 126 invalidating the cache line containing the target address. The cache line is invalidated in L1 cache 126 to prevent the LARX instruction from binding to a potentially stale value in L1 cache 126. The value is potentially stale because another processing unit 102 may have gained ownership of the target cache line in order to modify it.
Following block 206, the process passes to block 208. As illustrated at block 208, RC logic 142 obtains the load data from L2 cache array 140, system memory 108 or another processing unit 102 and then returns the requested load data to processor core 120. In response to receipt of the load data, processor core 120 stores the load data in an internal register, but not in L1 cache 126.
Processor core 120 thereafter attempts to perform an atomic update to the load data through the execution of a store-conditional (e.g., STCX) instruction in accordance with the process depicted in FIG. 2B. As shown, the process begins at block 220, which represents execution units 124 executing the store-conditional instruction to determine the target address of the store-conditional operation. Next, as depicted at block 222, the cache line containing the target address is invalidated in L1 cache 126, if valid. Although the cache line was invalidated earlier at block 202, the invalidation is still performed at block 222 because an intervening load operation to another address in the cache line may have caused the cache line to be loaded back into L1 cache 126.
Following block 222, processor core 120 issues a store-conditional (e.g., STCX) operation corresponding to the store-conditional instruction to RC logic 142 within L2 cache 130, as shown at block 224. RC logic 142 obtains owner permission for the target cache line and then determines at block 226 whether or not reservation flag 150 is still set (i.e., whether or not any other processing unit 102 has modified the reservation granule). If reservation flag 150 is still set, indicating that no other processing unit 102 has modified the reservation granule, RC logic 142 updates L2 cache array 140 with the store data and resets reservation flag 150, as shown at block 228. Reservation logic 146 then sends a pass indication to processor core 120, as shown at block 230. Thereafter, the process ends at block 234.
Returning to block 226, in response to a determination that reservation flag 150 is reset, indicating that another processing unit 102 has modified the reservation granule in the interval between execution of the load-reserve and store-conditional instructions, the store-conditional operation fails in L2 cache 130, and reservation logic 146 transmits a fail indication to processor core 120, as depicted at block 232. Thereafter, processing of the store-conditional operation terminates at block 234.
FIG. 2C illustrates the conventional operation of snoop logic 144 in support of shared memory updates utilizing load-reserve and store-conditional instructions. As depicted, the process begins at block 240 and thereafter proceeds to block 242, which illustrates the process iterating until snoop logic 144 snoops an operation on system bus 104. When snoop logic 144 snoops an operation on system bus 104, snoop logic 144 allocates a snooper to handle the operation at block 244. The snooper determines at block 246 whether or not the snooped operation is a storage-modifying operation. If not, the process passes to block 252 for other processing and thereafter terminates at block 254. If, however, the snooper determines that the snooped operation is a storage-modifying operation, the snooper makes a further determination at block 248 whether the address of the modifying operation matches the contents of reservation address field 148. If so, the snooper resets reservation flag 150 to cause any subsequent store-conditional operation to the address specified in reservation address field 148 to fail, as shown at block 250. Following block 250 or following a determination at block 248 that the address of the snooped modifying operation matches the contents of reservation address field 148, the snooper performs other processing at block 252 (e.g., updating the directory of L2 cache array 140). The process thereafter terminates at block 254.
LARX and STCX operations are often used to implement multi-processor locking mechanisms. A lock is acquired using a LARX/STCX pair, and the lock is usually considered acquired if the STCX succeeds. A lock is often stored in the same cache line as the data protected by it, as this behavior saves memory latency accessing the data after the lock is acquired. What is needed is a method to reduce L2 access in cases in which a lock is stored within the same cache line as the data it protects.