1. Technical Field
The present invention relates in general to data processing and in particular to communication between processors in a data processing system. Still more particularly, the present invention relates to a method, processing unit and system for processor communication and coordination within a multi-processor data processing system.
2. Description of the Related Art
It is well known in the computer arts that greater computer system performance can be achieved by harnessing the processing power of multiple individual processors in tandem. Multi-processor (MP) computer systems can be designed with a number of different architectures, of which various ones may be better suited for particular applications depending upon the intended design point, the system's performance requirements, and the software environment of each application. Known MP architectures include, for example, the symmetric multi-processor (SMP) and non-uniform memory access (NUMA) architectures.
In shared-memory, multi-processor data processing systems, each of the multiple processors in the system may access and modify data stored in the shared memory. In order to synchronize access to a particular granule (e.g., cache line) of memory between multiple processors, programming models often require a processor to acquire a lock associated with the granule prior to modifying the granule and release the lock following the modification.
In SMP architecture, the multi-processors communicate with each other over an interconnection bus utilizing “loads” and “stores” in and out of cacheable memory elements within the shared memory. When synchronizing the multi-processor system to perform pipelined or parallel processing, communication information is constantly transferred between the processors to allow each processor to coordinate with the other processors executing on the process. The processors communicate specific processor information, such as the state of a processor or status of a process, via loads and stores within the cache subsystem. When a processor reaches a state where its status information needs to be updated and communicated to the other processors, that processor takes exclusive control over the information by acquiring a lock over the data in order to change it. This causes the other processors holding this information to invalidate their copies and then load the status information again from memory after the first processor has stored its update to the information. This processor communication mechanism is inefficient because it requires the processors to constantly contend for control over the information, it requires flushing that information from the other processors, only to be reloaded again after the change has occurred, and it slows pipelined or parallel processes whenever a processor stores to the information and the other processors stall, awaiting update of the information.
The present invention recognizes that these inefficiencies consume large amounts of interconnect bandwidth and incur extremely high communication latency relative to the small percentage and small size of inter-processor communications and other transactions that are communicated between processors coupled by the interconnects. For example, even for the relatively simple case of an 8-way SMP system in which the four processors present in each of two nodes are coupled by an upper level bus and the two nodes are themselves coupled by a lower level bus, communication of a data request between processors in different nodes will incur bus acquisition and other transaction-related latency at each of three buses. Even inter-processor communications between processors in the same node must consume upper-level bus bandwidth and incur bus latency. Because such latencies are only compounded by increasing the depth of the interconnect hierarchy, the present invention recognizes that it would be desirable and advantageous to provide an improved data processing system architecture having reduced latency for communications between physically remote processors and having reduced bus bandwidth consumption, thereby freeing bus bandwidth for general data transfer between the processors and the hierarchical memory system.