This invention is related to computers and computer systems, and in particular to mechanisms for resolving address contention and prioritization of access to resources within a shared memory system.
Multiprocessor systems can take many forms and individual designs may contain many unique features. Common among multiprocessor systems is the requirement to resolve shared address conflicts. Shared address conflicts occur when one or more processors attempt to update shared data. Since resolving this type of conflict necessitates a serialized access, system designers avoid scenarios where this type of activity occurs. For example, a processing unit may be assigned a private address space by the operating system; enabling that processor to work uninhibited by conflicts. Even in this environment, an idle processor will commonly obtain new work from a queue stored in a shared address space. As the speed and number of processors increase, this coordination of workload becomes more critical. Some workloads, however, desire interaction among many processors and efficient resolution of conflicts is required even with relatively slow processors. For example, large databases are maintained for many businesses, these databases can be updated by several applications running simultaneously. Conflict resolution often becomes the limitation of the system. It is desired to have a multiprocessor system that minimizes these shared conflicts, but also minimizes the performance impact when it does occur.
Technological advances have created faster processors while also giving denser, but relatively slower memory. Cache hierarchies, layers of faster but smaller capacity memories, have been added to offset some of this impact and reduce access delay. Since the cache is a subset of the total memory a processor can access, a directory is required to keep track of which blocks of main memory correspond to what blocks are held in the cache. Since all updates to memory must be visible to all processors in a shared memory multiprocessor system, changes to data in the caches must be made available to all processors and devices in this system. A common method in the art has been to add tags to the directories of the caches that indicate the ownership state of each block in the cache (directory-based cache coherence). This ownership state will indicate a processors write authority of the block. If a processor wishes to update a block of data in a cache, it must first obtain exclusive rights via some interprocessor communication. Once it has exclusive authority the processor may change the directory ownership state of the contested block and proceed with its updates. What is important is that interprocessor communication is required to pass ownership of shared blocks between processors. This interprocessor communication can add significant delay to the overall delay associated with accessing data. Access to the interprocessor communication is usually serialized in order to ensure one processor can update the contested block. This usually means that processors must request priority in some manner to use the required resources. A good priority design, one that ensures fair access, is critical to ensure proper work distribution and avoid starvation of requesters. As the number of memory requesters increase, it becomes more difficult to maintain equal access to memory resources and can impede the scalability of the multiprocessor system. A priority system that can reduce the negative impact of the processor interconnect and associated traffic is desired. Priority designs have been used that enter requests in a centralized queue or similar ordering mechanism to ensure requests are presented in the same order they are received. The queue maintains the order while the memory system completes each request presented by this queuing system. This solution guarantees the order, but requires the order to be set before any resource is evaluated for availability or conflict. An example would be the availability of cache interleaves. With this solution, no request could bypass a request that was stalled due to its target cache interleave being unavailable. This means that the additional latency for that one request is now added latency to all requests queued behind it. Also, the requests in the queue may not have an address conflict with the stalled request and therefore do not benefit from the forced serialization. Additional patches to avoid this queuing effect could be employed at the input of the stack. For example, creating multiple stacks based on an address range would require checking the address before entry into the stack. The effectiveness of this solution would be limited by how much hardware, in the form of physical arrays or other memory device, could be available for this purpose. Also, all improvements of this kind would negatively impact the nominal latency by adding additional checking and categorization of requests before priority. Some other priority schemes use sampling in an attempt to reduce some of the complex interactions that can cause request starvation. The sample, or snapshot, tags the requests outstanding at a given time and ensures that all of these requests are satisfied before a new sample is taken. Since a satisfied request in the current snapshot cannot create a visible request until the snapshot is emptied, some starvation scenarios may be avoided. However, snapshot designs depend on the requests not having dependencies between them which, in some implementations, may not be true and can lead to a deadlock condition: a request in the snapshot waiting for a request not in the snapshot. This class of solution does not attempt to improve access among contentious requests, it just limits the scope of the problem to a scale presumed to be manageable and is therefore likely to add to the nominal latency without a guarantee of success.
A Least Recently Used (LRU) priority algorithm may be used to ensure that all processors have fair access. In order to limit the latency of the priority request, a partial-LRU is used. This partial-LRU uses fewer bits and allows for quicker calculation of priority. In this system, requests are arbitrated and presented to a pipelined structure. The request moves through this pipeline and initiates a cache-access and associated directory lookup, checks resource availability and checks if any other request has the same address locked. If there is no owner, the current requester assumes ownership by setting a lock. This lock remains active until the request has been satisfied. Once a lock is set, all subsequent requests to the same address block their memory access and set a resource-need for the owning requester to complete. This resource-need prevents further pipeline accesses until the owning request completes. The owning request is then free to change the ownership status of the line, if necessary, and return the requested data to the processor. Such a system works well until address activity, as in the interprocessor synchronization of the kind described earlier, occurs. In those cases, many requests are attempting to access the same address. They will all enter the pipeline and set their resource-need for the owning processor, the owning processor will complete and the remaining requests will all vie for priority again, a new owner will set its lock and all subsequent requests will then set a resource-need for the new owner. Each request will busy the pipe, and other resources, only to set its resource-need for the newly designated owner. Once the new owner completes the process starts again. With each completion, the priority mechanism is tested again and resources busied causing increased traffic and latency. In addition, a completed processor may issue another request to the same address before all processors have accessed the data. Since the priority logic has been optimized for best-case, and due to inherent latency with the request generation after a lock is cleared, the new request can beat those waiting. The combination of the partial-LRU rather than full LRU, latency of transferring ownership, the additional traffic and the optimization of new requests can cause lockout scenarios. Prior systems exhibited this type of processor starvation and attempts were made that correct some special case scenarios. Hang avoidance hardware, added to avoid deadlock situations, has also been used to avoid processor initiated recovery.
As more processor requesters are added, traffic and latency are added and an improved arbitration device is necessary.
Requests made by processors, in a multi-processor system, to the same address space of shared memory, are satisfied in the order that the requests are received. In a computer system that contains a plurality of processors attached to a common memory sub-system, multiple requesters are often contesting for the same address space at the same time. Memory controller resource availability and access thereto can force undesirable ordering among the requests. However, the same complex resource interactions dictate a solution that does not serialize all requests, i.e, requester B should not wait for requester A unless there is a contention between A and B. This invention provides that all requests have equal access to the memory resources unless the requests are attempting to access the same location in memory at the same time. Once this contention has been identified, access to this location is ordered.
As the shared memory controller processes each request, address contention is checked. If there is no current owner designated to access the specified address range, that request is granted ownership of that address space until its request is satisfied. Subsequent requests for the same memory location set their need for the last requester to see the same conflict rather than the first. As each master completes, only one requester resets its need and is processed. There can be any number of these ordered lists and any number of requesters on each list. Heretofore, all subsequent requests for that same address space would see this owner and set a resource need latch for it to complete. Once this address-owner completes, all remaining requests are processed again.
A method of serializing access to an address space without a negative impact to memory access to different address spaces is accomplished by dynamically creating ordered lists of requests for each contested address. A new request is added to the list only after a conflict is recognized. Since the address conflict does not always exist, there is no impact to a request for an uncontested address. Hardware is added that will generate a resource-need corresponding to the last requester that encountered the same address contention as opposed to setting a resource-need for the owning requestor. Any number of these ordered lists may exist. For example, in a system with twenty requesters, there can be twenty, one requester ordered xe2x80x98listsxe2x80x99, or one, twenty requester ordered-list or any combination in between. There is no physical limitation added by the ordering device. The creation of the lists depends on a certain lock bit. As mentioned earlier, a regular lock is set if no address conflict is recognized and that requester is given ownership. This lock moves and is always held by the last requester in each address differentiated ordered list. New requesters will recognize a conflict against the last requester rather than the first and set its resource-need accordingly. In this way the first requester is free to update the status of the cache block unencumbered by all other contestants and the list ensures fair, orderly access, to memory. At any time, a processor in the ordered list may be forced into recovery and temporarily taken out of the system. Care must be taken to ensure that this does not cause a deadlock condition. Other cases occur with setting and resetting the moving lock, especially in a pipelined environment where contention can occur each cycle for many cycles.
Other contested resources may also be included. For example, if multiple cache-block spill/fill resources are available, these limited resources may become contested. The same resolution can occur. Several processor requests may miss the cache and attempt to load the cache-spill/fill resource only to find that there are none available. This requester will set its resource-need for the next one to become available. Once this happens, it will make another pipeline pass to load the resource only to find that a requester one cycle ahead in the pipeline takes the last one. In this case ordered lists can be created for the spill/fill resource in the same manner as the address contention. The same benefits are also realized. For example, only requests that actually need the spill/fill resource are forced into the list, and only when the resource is unavailable.