1. Technical Field
The present invention relates in general to data processing and, in particular, to allocating and accessing resources within a data processing system. In at least one embodiment, the present invention relates still more particularly to a method and system for efficiently allocating and accessing promotion facilities, such as locks, in a data processing system.
2. Description of the Related Art
In shared memory multiprocessor (MP) data processing systems, each of the multiple processors in the system may access and modify data stored in the shared memory. In order to synchronize access to a particular granule (e.g., cache line) of memory between multiple processors, programming models often require a processor to acquire a lock associated with the granule prior to modifying the granule and release the lock following the modification.
In a multiprocessor computer system, multiple processors may be independently attempting to acquire the same lock. In the event that a processor contending for a lock successfully acquires the lock, the cache line containing the lock is transmitted via the system bus from system memory or the cache hierarchy of another processor and loaded into the processor""s cache hierarchy. Thus, the acquisition and release of locks in conventional data processing systems can be characterized as the movement of exclusively held cache lines between the data caches of various processors.
Lock acquisition and release is commonly facilitated utilizing special memory access instructions referred to as load-reserve and store-conditional instructions. In shared memory MP data processing systems that support load-reserve and store-conditional instructions, each processor within the system is equipped with a reservation register. When a processor executes a load-reserve to a memory granule, the processor loads some or all of the contents of the memory granule into one of the processor""s internal registers and the address of the memory granule into the processor""s reservation register. The requesting processor is then said to have a reservation with respect to the memory granule. The processor may then perform an atomic update to the reserved memory granule utilizing a store-conditional instruction.
When a processor executes a store-conditional to a memory granule for which the processor holds a reservation, the processor stores the contents of a designated register to the memory granule and then clears the reservation. If the processor does not have a reservation for the memory granule, the store-conditional instruction fails and the memory update is not performed. In general, the processor""s reservation is cleared if a remote processor requests exclusive access to the memory granule for purposes of modifying it (the request is made visible to all processors on a shared bus) or the reserving processor executes a store-conditional instruction. If only one reservation is permitted per processor, a processor""s current reservation will also be cleared if the processor executes a load-reserve to another memory granule.
A typical instruction sequence for lock acquisition and release utilizing load-reserve (lwarx) and store-conditional (stwcx) instructions is as fallows:
As indicated, the typical instruction sequence includes at least two separate branch xe2x80x9cloopsxe2x80x9dxe2x80x94one (identified by xe2x80x9cBxe2x80x9d) that is conditioned upon the processor obtaining a valid reservation for the lock through successful execution of the load-reserve instruction, and another (identified by xe2x80x9cCxe2x80x9d) conditioned upon the processor successfully updating the lock to a xe2x80x9clockedxe2x80x9dstate through execution of the store-conditional instruction while the processor has a valid reservation. The lock acquisition sequence may optionally include a third branch loop (identified by xe2x80x9cAxe2x80x9d) in which the processor determines whether the lock is available prior to seeking a reservation for the lock.
This conventional lock acquisition sequence incurs high overhead not only because of its length but also because of the conditional nature of reservations. That is, a first processor may lose a reservation for a lock before successfully acquiring the lock (through execution of a store-conditional instruction) if a second processor stores to (or acquires ownership of) the lock first. Consequently, if a lock is highly contended, a processor may make a reservation for a lock and lose the reservation many times prior to successfully acquiring the lock through execution of a store-conditional instruction.
At least one processor manufacturer has tried to address this problem by implementing a xe2x80x9cbrute forcexe2x80x9dsolution in which a processor executing a load-reserve instruction is granted exclusive access to the interconnect. That is, while the reservation is held by the processor, only the processor executing the load-reserve instruction is permitted to master operations on the interconnect, and all other processors are xe2x80x9clocked out,xe2x80x9d not just from accessing a particular data granule, but from initiating any operation on the interconnect. Consequently, the processors locked out of the interconnect may stall for lack of data while the reservation is held. Obviously, this solution does not scale well, particularly for systems running code in which locks are highly contended.
The present invention recognizes that the conventional lock acquisition and release methodologies described above, although effective at synchronizing access by multiple processors to shared data, have a number of attendant shortcomings. First, conventional lock acquisition and release sequences that employ load-reserve and store-conditional instructions require the inclusion of special purpose reservation registers and reservation management circuitry within each processor, undesirably increasing processor size and complexity.
Second, as noted above, the typical lock acquisition and release sequence is inherently inefficient because of the conditional nature of reservations. If a lock is highly contended, multiple processors may gain and lose reservations for a lock many times before any processor is permitted to obtain the lock, update the lock to a xe2x80x9clocked state,xe2x80x9d and do work on the data protected by the lock. As a result, overall system performance degrades.
Third, the lock acquisition and release methodologies outlined above do not scale well. For example, in the conventional lock acquisition instruction sequence, the overhead incurred in acquiring a lock increases with the scale of the data processing system. Thus, although it is more desirable in large-scale data processing systems having numerous processors to employ fine grain locks (i.e., a large number of locks that each protect a relatively small data granule) to enhance parallelism, the increasingly high lock acquisition overhead can force the adoption of coarser grain locks as system scale increases in order to reduce the percentage of processing time consumed by lock acquisition overhead. Such design compromises, though viewed as necessary, significantly diminish the amount of useful work that can be effectively distributed over multiple processors.
Fourth, because lock variables are conventionally treated as cacheable operand data, each load-type and store-type operation within the lock acquisition sequence triggers data cache directory snoops, coherency message traffic on the system bus, and other conventional operations dictated by the cache coherency protocol implemented by the data processing system. The present invention recognizes that these data-centric cache coherency operations, which consume limited system resources such as data cache snoop queues, bus bandwidth, etc., are not necessary because the data value of the lock itself is not required for or useful in performing the work on the data granule protected by the lock.
In view of the foregoing and other shortcomings of conventional techniques for acquiring and releasing locks in a data processing system, and more generally, of techniques for inter-component coordination and accessing memory-mapped resources, the present invention introduces, interalia, new methods and apparatus for allocating and accessing memory-mapped resources such as a global promotion facility that is not limited to, but can be advantageously employed as, as a lock facility.
In accordance with the present invention, a multiprocessor data processing system includes a plurality of processors coupled to an interconnect and to a memory including an promotion facility containing at least one promotion bit field. A first processor among the plurality of processors executes a load-type instruction to acquire a promotion bit field within the global promotion facility exclusive of at least a second processor among the plurality of processors. In response to execution of the load-type instruction, a register of the first processor receives a register bit field indicating whether or not the promotion bit field was acquired by execution of the load-type instruction. While the first processor holds the promotion bit field exclusive of the second processor, the second processor is permitted to initiate a request on the interconnect.
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.