1. Technical Field
The present invention relates in general to data processing and in particular to communication between processors in a data processing system. Still more particularly, the present invention relates to a method, processing unit, and system for providing data updates during processor communication and coordination within a multi-processor data processing system.
2. Description of the Related Art
It is well known in the computer arts that greater computer system performance can be achieved by harnessing the processing power of multiple individual processors in tandem. Multi-processor (MP) computer systems can be designed with a number of different architectures, of which various ones may be better suited for particular applications depending upon the intended design point, the system's performance requirements, and the software environment of each application. Known MP architectures include, for example, the symmetric multi-processor (SMP) and non-uniform memory access (NUMA) architectures.
In modem data processing systems, it is often the case that processes on differing processors must communicate and synchronize their activities. This communication usually is accomplished by means of reads and writes to shared memory variables that are controlled by a number of fencing and synchronization instructions. The various processes read (load) and write (store) to these locations to communicate their status on parallel tasks and coordinate their overall operations.
In shared-memory multi-processor data processing systems, each of the multiple processors in the system may access and modify data stored in the shared memory. In order to synchronize access to these shared data, programming models often require a processor to acquire a so-called “lock” variable associated with the particular shared memory region prior to modifying the shared memory region and to then “release” the lock variable following the modification of the shared data.
A lock variable is, typically, itself, a shared memory location. Processors obtain a lock by first examining the lock variable with a load operation to determine if the lock is already taken. If not taken, the processor can then attempt to obtain the lock by executing a store operation to update the lock to a “taken” value and if successful in that update to then access the shared data. If, however, the lock is already taken by another processor, the current process often continues to read the lock variable until it reads a value indicating the lock is available and that the current processor can attempt to obtain the lock again. Once a processor obtains the lock and has completed accessing the shared data, the processor releases the lock by executing a store to the lock variable setting the lock variable to an “unlocked” state. Such a store will cause, through an appropriate coherency protocol, any other processors to see the new value of the lock variable.
Such techniques are well known to those skilled in the art. Furthermore, in circumstances of high contention in modem cache based multiprocessors, these shared memory lock techniques can lead to a great deal of coherency and data traffic on system interconnection buses to manage the lock variables.
For example, when a lock is released, the coherency protocol must invalidate the current copy, if present, of the cache line holding the lock variable at each processor. Furthermore, each of these processors will then typically attempt to read the lock variable again in an attempt to obtain the lock. This will result in a separate cache intervention or sourcing of the cache line from memory to each participant. As can be seen, when using shared memory for lock variables, a great number of data interventions and cache coherency traffic are required to transfer the data, usually one-by-one, for the lock variables to all the participants in the process. The transactions related to lock variables consume large amounts of interconnect bandwidth and incur high communication latency relative to the small percentage and size of information transmitted through the lock variables.
Current processor communication and sharing of data are made possible through the utilization of buses or wire interconnects running between the various components. These buses communicate operations between the various components (e.g., processors and caches) in the system. Typically, a set of internal buses is utilized to communicate operations and data between internal caches and the processor within a processor chip. For example, most processor chips are designed with integrated L1 and L2 caches. These internal processor buses are utilized to communicate operations and data to and from the processor and the internal L1 and L2 caches.
In addition to these internal processor buses, multiprocessor system typically also include a system level interconnect consisting of one or more buses that are utilized to interconnect the various processor chips to form the multiprocessor data processing system. A number of differing topologies may be employed to interconnect these various processor chips. The system interconnect allows for the communication of data and operations between the various processor chips in the multiprocessor data processing system. The system interconnect is typically divided into two distinct paths: one for the transfer of address and control information (i.e. the “address bus”) and another for the transfer of data (i.e. the “data bus”). These two distinct paths are often collectively referred to as a single “bus”.
Operations are typically placed onto the system interconnect in phases. The first of these phases is typically referred to as an “address tenure” and begins with a processing element placing address and control information for the operation on the address bus portion of the system interconnect. Once placed on the address bus, the operation's address tenure is then typically broadcast to all the other processor chips within the multiprocessing system. At each processor chip, the address tenure is “snooped” and the current processor chip determines if it can honor and/or allow the operation to proceed. Based on this determination, the snooping processor forms a partial snoop response indication that is placed onto the system interconnect for the address operation. This partial response indicates, among other things, whether the operation can be allowed to proceed or not. If the operation cannot be allowed to proceed a “retry” indication is given that states the current snooping processor chip requires that the current snooped operation cannot be allowed to proceed and should be retried again at a later time.
These individual snoop responses are collected up from the various snooping processor chips and combined by the system interconnect to form an overall “combined response” that indicates, at an overall system level, whether the operation can proceed or not. For example, a given processing element may be able to allow an operation to proceed and would not respond with a retry indication. However, another processing element may not be able to allow the operation to proceed and would produce a “retry” snoop response. In this circumstance, the overall combined response will indicate that at least one participant cannot permit the operation to proceed and therefore it must be retried by the initiating processor chip again at a later time. This combined response is communicated back to the initiating processing element concluding the address tenure.
It should be noted that the address bus portion of the system interconnect is typically designed in such a way as to broadcast address tenures to all participants in a multiprocessor as efficiently as possible. That is to say, the address bus topology is designed to make the address and control information associated with a given address tenure visible to all the processing elements within a multiprocessor as quickly and efficiently as possible.
One class of operations within a multiprocessor system are those operations that consist solely only of an address tenure. These operations typically do not involve the explicit transfer of data over the data bus portion of the system interconnect and are typically referred to as “address-only” operations. An address-only operation is completed with the successful completion of the address tenure for that operation as indicated by a good combined response.
However, another class of operations within a multiprocessor system are those operations that involve the explicit transfer of data. These operations consist of an address tenure specifying the type of operation and the address of the operation followed by a data tenure on the data bus portion of the system interconnect to transfer the data. Such operations are called address/data operations. Once the address tenure for an address/data operation has completed successfully, the address tenure is followed by a data tenure on the data bus portion of the system interconnect to transfer the data. The data tenure consists of arbitrating to obtain ownership of the data bus portion of the system interconnect and then transferring the requested data, typically in a series of data “beats”, over the data bus portion of the system interconnect.
Unlike the address bus portion of the system interconnect, the data bus portion of the system interconnect is typically not designed to broadcast the transferred data to all participants within the multiprocessor system. Rather, a point-to-point mechanism is employed to more directly route data from the source providing the data (typically a memory controller or a intervening cache controller) to the destination (typically the processing element requesting the data, although on writes the requesting processor typically sources the data to a memory controller). Due to the multi-beat nature of data transfers and the bandwidth required, it is impractical and unnecessary to make any given data transfer visible to all processor chips within the system. Hence, data transfers are typically visible to only the source and destination participants and at most some subset of the other participants that may be topologically “in-between” these participants on the data bus portion of the system interconnect.
In general, once the address tenure portion of an address/data tenure operation is completed, the address bus portion of the system interconnect can continue working on subsequent address tenure operations independent of the completion of the current data tenure on the data bus portion of the system interconnect. A system interconnect that is split into distinct address and data buses and that transmits operations in distinct and potentially independently overlapped phases is known as a “split transaction” bus.
As can be seen from the foregoing, a potentially large number of address tenure and address/data tenure operations may be involved in the management of shared memory lock variables. When many processors are contending to obtain a lock, many address/data tenure transfers will be required to transfer the value of the lock cache line to the participating processor chips which will then likely cache the lock variable value. When a lock is released due to a store, these cached copies will be invalidated causing the participating processors to attempt to re-acquire the lock. This action will typically require a separate new address/data tenure operation to again transfer the cache line containing the lock variable to each participant as said participants attempt to acquire the lock. The leads to long interconnect latencies and significant use of interconnect bandwidth and resources.
Therefore, the present invention recognizes that a need exists for an efficient mechanism to complete data transfer between processors without the latencies involved with current processor-to-processor data communication protocols. A system in which the processors each maintain a separate communication register for direct inter-processor communication and for concurrently updating these communication registers in a broadcast rather than point-to-point fashion would be a welcomed improvement. These and other benefits are provided by the invention described herein.