1. Field of the Invention
The present invention generally relates to computer systems and, more particularly, to a method of synchronizing instructions executed by different processors in a multi-processor computer system.
2. Description of the Related Art
Modern computing systems are often constructed from a number of processing elements and a main memory, connected by a generalized interconnect. The basic structure of a conventional multi-processor computer system 10 is shown in FIG. 1. Computer system 10 has several processing units 12a, 12b, and 12c which are connected to various peripheral devices, including input/output (I/O) devices 14 (such as a display monitor, keyboard, and permanent storage device), memory device 16 (such as dynamic random-access memory or DRAM) that is used by the processing units to carry out program instructions, and firmware 18 whose primary purpose is to seek out and load an operating system from one of the peripherals (usually the permanent memory device) whenever the computer is first turned on.
Processing units 12a-12c communicate with the peripheral devices by various means, including a bus 20. Computer system 10 may have many additional components which are not shown, such as serial and parallel ports for connection to, e.g., modems or printers. Those skilled in the art will further appreciate that there are other components that might be used in conjunction with those shown in the block diagram of FIG. 1; for example, a display adapter might be used to control a video-display monitor, a memory controller can be used to access memory 16, etc. The computer can also have more than three processing units. In a symmetric multi-processor (SMP) computer, all of the processing units 12a-12c are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture.
The processing units can themselves consist of: a processor (having a plurality of registers and execution units, which carry out program instructions in order to operate the computer); other elements (such as caches) that form a "memory hierarchy" for the processing unit; or even multiple processor nests in a single processing unit. Many possible configurations are possible, and this list is not meant to be a complete taxonomy. In these systems, the processors communicate with main memory and one another by passing operations over the interconnect. To coordinate their operations, most processing units employ some form of "retry" protocol. In this protocol, when a processing unit wishes to initiate an operation, the initiating processing unit places the operation on the interconnect. All of the other units which are attached to the interconnect monitor ("snoop") the operation and determine if it can be allowed to proceed. If one or more of the snoop participants cannot allow the operation to proceed, the initiating processor is signaled through some mechanism that the operation needs to be retried. The initiating processor then typically places the operation out on the interconnect again at a later time. To support snooping and to facilitate the determination of a "retry" or "no retry" response for each snooped operation, most processing units employ some form of "snoop queue." This queue is a list of entries for each processing unit that holds the state of snoop operations that are currently in-progress on the processing element.
In certain cases, a processing unit may not issue a "retry" response to an operation, thereby signalling that earlier operations have been completed when, in fact, the effects of the earlier operations have not been entirely propagated throughout the processing unit. This approach allows for potentially higher performance by pipelining operations onto the interconnect. It is often the case, however, that an initiating processing unit needs some mechanism to ensure that all of the effects of operations previously initiated have, in fact, been completely propagated through all the other processing units in the system. To allow this, most architectures support the concept of a synchronization instruction, such as the sync instruction used in IBM PowerPC.TM. processors. This instruction is executed on a given processor and ensures that all the effects of the operations initiated by that processor have been completely propagated within all other processing elements in the system. The exact semantics of this instruction vary in detail, but not general concept, between architectures. The sync instruction is, for example, often used following a flush instruction. The flush instruction is used to insure that no block corresponding to the address of the flush instruction is present in any cache in the overall system. To achieve this, the flush instruction causes all caches to abandon any unmodified copies of the cache block or, if the block is modified, to write the block back to main memory.
The sync instruction, unlike the operations it is used to ensure have completed, must be retried until all of the previous operations have, in fact, been completely finished on all processing units. To accomplish this, a processing unit, upon snooping a sync, must retry it if any operations are outstanding from the processor that issued the sync. However, in many current interconnect protocols (IBM's 60X bus protocol, for example), it is not possible to determine from which processing unit an operation originated. In such a system, a processing unit must then retry every sync instruction if any operations are outstanding in its snoop queue, regardless of the origin of those operations, because it is impossible to determine if the outstanding operations in the snoop queue are from the processing unit issuing the sync. Furthermore, in the prior art, it is also impossible to determine what processing element is issuing the sync instruction in the first place. This limitation is inherently inefficient and can lead to starvation and livelocks. It would, therefore, be desirable and advantageous to devise a method of synchronizing processor operation in a multi-processor computer system which eliminated unnecessary delays or degradation of performance due to a processing unit waiting for completion of operations which are irrelevant to the synchronization.