1. Technical Field
The present invention relates in general to data processing and more particularly to handling the processing of barriers in a data processing system.
2. Description of the Related Art
A conventional multiprocessor (MP) computer system, such as a server computer system, includes multiple processing units all coupled to a system interconnect, which typically comprises one or more address, data and control buses. Coupled to the system interconnect is a system memory, which represents the lowest level of volatile memory in the multiprocessor computer system and generally is accessible for read and write access by all processing units. In order to reduce access latency to instructions and data residing in the system memory, each processing unit is typically further supported by a respective multi-level cache hierarchy, with each lower level generally having a successively longer access latency. Thus, a level one (L1) cache generally has a lower access latency than a level two (L2) cache, which in turn has a lower access latency than a level three (L3) cache.
Because multiple processor cores may request write access to a same cache line of data and because modified cache lines are not immediately synchronized with system memory, the cache hierarchies of multiprocessor computer systems typically implement a cache coherency protocol to ensure at least a minimum level of coherence among the various processor core's “views” of the contents of system memory. In particular, cache coherency requires, at a minimum, that after a processing unit accesses a copy of a memory block and subsequently accesses an updated copy of the memory block, the processing unit cannot again access the old copy of the memory block.
A cache coherency protocol typically defines a set of cache states stored in association with the cache lines stored at each level of the cache hierarchy, as well as a set of coherency messages utilized to communicate the cache state information between cache hierarchies. In a typical implementation, the cache state information takes the form of the well-known MESI (Modified, Exclusive, Shared, Invalid) protocol or a variant thereof, and the coherency messages indicate a protocol-defined coherency state transition in the cache hierarchy of the requestor and/or the recipients of a memory access request. The MESI protocol allows a cache line of data to be tagged with one of four states: “M” (Modified), “E” (Exclusive), “S” (Shared), or “I” (Invalid). The Modified state indicates that a memory block is valid only in the cache holding the Modified memory block and that the memory block is not consistent with system memory. When a coherency granule is indicated as Exclusive, then, of all caches at that level of the memory hierarchy, only that cache holds the memory block. The data of the Exclusive memory block is consistent with that of the corresponding location in system memory, however. If a memory block is marked as Shared in a cache directory, the memory block is resident in the associated cache and in at least one other cache at the same level of the memory hierarchy, and all of the copies of the coherency granule are consistent with system memory. Finally, the Invalid state indicates that the data and address tag associated with a coherency granule are both invalid.
The state to which each memory block (e.g., cache line or sector) is set is dependent upon both a previous state of the data within the cache line and the type of memory access request received from a requesting device (e.g., the processor). Accordingly, maintaining memory coherency in the system requires that the processors communicate messages via the system interconnect indicating their intention to read or write memory locations. For example, when a processor desires to write data to a memory location, the processor may first inform all other processing elements of its intention to write data to the memory location and receive permission from all other processing elements to carry out the write operation. The permission messages received by the requesting processor indicate that all other cached copies of the contents of the memory location have been invalidated, thereby guaranteeing that the other processors will not access their stale local data.
In the MP system, the memory subsystem and associated access logic implement a selected memory model, that is, a set of rules regarding the ordering that must be observed between memory modifying operations (e.g., store operations) executed within the same processing unit and different processing units. For example, some architectures enforce so-called “strong” ordering between stores, meaning that the store operations of each processor core must be performed by the memory subsystem according to the program order of the associated store instructions executed by the processor core. Other architectures permit so called “weak” ordering between stores, meaning that the store operations of each processor core are permitted to be performed out-of-order with respect to the program order of the associated store instruction executed by the processor core.
In a computer architecture permitting weak ordering between stores, barrier instructions are placed into the program code to bind the order in which updates to memory are performed. For example, if the program code includes four stores, a barrier instruction can be placed between the first two stores and the last two stores to separate the four stores into two sets of memory updates. With this barrier instruction in the code, the first two stores can be performed (and the effects observed) in any relative order, and the last two stores can be performed (and the effects observed) in any relative order. However, the barrier instruction ensures that any processing unit in the MP system will observe all the memory updates from the two stores preceding the barrier instruction at the time it detects any of the stores that follow the barrier instruction.
In the prior art, the observation rule for weakly ordered stores was enforced by the initiator processing unit that processed the barrier instruction transmitting a SYNC operation on the system bus to all other processing units of the MP system. Any other observing processing unit having a pending kill for the initiating processor (i.e., the observed effect of the initiating processor's stores) retries the SYNC operation until the observing processing unit completes the cache line invalidations indicated by the pending kills. Only after all such invalidations are performed is the SYNC operation permitted to complete without retry. Thus, barrier instructions (which occur on average approximately every 200-300 instructions) lead to significant consumption of interconnect and memory subsystem bandwidth, particularly as system scale grows and retries of SYNC operations become more frequent.