1. Technical Field
The present invention relates in general to synchronization of processing in multiprocessor systems and in particular to presentation of synchronization bus operations on a multiprocessor system bus. Still more particularly, the present invention relates to selective synchronization by filtering out unnecessary synchronization bus operations prior to presentation on the system bus based on historical instruction execution information.
2. Description of the Related Art
Programmers writing software for execution on multiprocessor data processing systems often need or desire to provide points within the flow of instruction execution serving as processing boundaries, ensuring that all instructions within a first code segment are fully executed before any instructions within a subsequent code segment are executed. This is particularly true when the multiprocessor system includes superscalar processors supporting out-of-order instruction execution and weak memory consistency. The instructions sets supported by most popular commercial processors includes an instruction for setting such a processing boundary. In the PowerPC.TM. family of processors, for example, the instruction which may be employed by a programmer to establish a processing boundary is the "sync" instruction. The sync instruction orders the effects of instruction execution. All instructions initiated prior to the sync instruction appear to have completed before the sync instruction completes, and no subsequent instructions appear to be initiated until the sync instruction completes. Thus, the sync instruction creates a boundary having two significant effects: First, instructions which follow the sync instruction within the instruction stream will not be executed until all instructions which precede the sync instruction in the instruction stream have completed. Second, instructions following a sync instruction within the instruction stream will not be reordered for out-of-order execution with instructions preceding the sync instruction.
In the PowerPC.TM. family of devices, an architected logic queue is employed to hold "architected" instructions which have been issued by a corresponding processor but which have not been executed. As used herein, architected instructions are those instructions which might affect the storage hierarchy as perceived by other devices (other processors, caches, and the like) within the system. These include essentially any instruction which affect the storage hierarchy except load/stores to cacheable memory space. Examples for the PowerPC.TM. family of devices include: tlbi (translation lookaside buffer invalidate); tlbsync (translation lookaside buffer synchronize); dcbf (data cache block flush); dcbst (data cache block store); icbi (instruction cache block invalidate); and load/stores to noncacheable memory space (e.g., memory mapped devices).
The synchronization instruction affects or is affected by both cacheable operations (normal loads and stores) and architected operations. A processor which has issued a cacheable operation which is pending will not issue a synchronization instruction until pending cacheable operations are complete, which the processor may ascertain from the return of appropriate data. The processor essentially stalls the synchronization instruction until any pending cacheable operations are complete.
Architected operations received from a local processor may be queued in the architected logic queue until necessary resources become available for performing the architected operation. When a sync instruction is received while the architected logic queue is not empty, the sync instruction is retried until the queue is drained. Once the local architected logic queue is drained, the sync instruction is presented on the system bus for the benefit of other devices which may not have completed their operations. Thus, in current architectures, sync instructions always get presented on the system bus. The sync operation is always made visible on the system bus because the initiator device receiving the sync instruction from a local processor has no historical information regarding its own past operations to determine whether it initiated an architected operation, and no information regarding the status of architected operations within devices snooping such architected operations from the system bus. Such snooping devices may, upon receipt of an architected logic operation, return an indication that the architected operation is complete when the operation was actually queued (posted). Moreover, architected operations generally do not return data, but comprise "address-only" operations. The initiator device thus lacks any basis for filtering out unnecessary sync operations since, even if the initiator device's own architected queue is drained when the sync instruction is received from a local processor, the initiator device has no means for determining whether other devices in the memory hierarchy have a snooped architected operation pending. Therefore, despite the fact that the architected queue remains relatively empty most of the time (since architected operations occur relatively infrequently), many sync operations are seen on the system bus. Under current architectures, as many as one in every 100 system bus cycles may be consumed by a sync-type operation.
The need to filter unnecessary sync operations is significant because sync instructions do not scale with technology. As technology progresses, particularly device sizes, many aspects of data processing system performance scale. For example, the number of execution units within a processor may increase to allow more instructions to be executed in parallel. Larger caches may be implemented, resulting in more cache hits and fewer misses. Sync operations, on the other hand, do not scale; instead the penalty associated with sync operations worsens as technology progresses. Even if sync instructions remain a fixed percentage of all runtime instructions, because more instructions are being executed in parallel, the sync instructions consume a larger portion of available processor cycles and bandwidth. Furthermore, as memory hierarchies--all levels of which are affected by a sync instruction--become deeper, the performance penalty associated with a single sync instruction increases.
It would be desirable, therefore, to provide a mechanism for filtering unnecessary synchronization operations from presentation on a multiprocessor system bus. It would further be advantageous if the mechanism permitted selective synchronization based on types of instructions and/or operations historically executed by the device receiving the synchronization instruction.