In multiprocessor systems, situations arise in which the instruction stream executing on one or more processors depends on the successful completion of a memory store operation issued by another processor. Under some circumstances, the store is unable to successfully place its data into memory because the operations initiated by other system participants totally consume the available interconnection bandwidth or have the equivalent effect of blocking the store due to side effects of hazard detection hardware. This store "starvation" may result in a failure to make forward progress in the program which ultimately causes the program to fail. An example of this can be seen in the following pseudo-code:
P1: P2: loop: load word Rx, A store word Rz, A load word Ry, B compare word immediate A, value branch if not equal loop
Two processors P1, P2 are involved in a spin loop in which one is waiting for a specific value to be stored by the other processor. Rx, Ry, and Rz refer to processor registers, and A and B are memory addresses. The "compare word immediate" uses a literal value, but comparison to any other source (such as the contents of another register) could also be used.
The store word to A executing in processor P2 updates a location with a value which is required by the code executing on processor P1 to make forward process. The loop continues until the expected value is obtained. The second load word instruction (from location B) executing in processor P1 is not strictly needed to create the starvation scenario if location A is not placed into processor P1's cache memory. It is shown in this example to describe a more common situation where locations A and B are cacheable. If processor P1's cache is direct mapped and the addresses of locations A and B cause them to occupy the same slot in that cache, the instruction loading word A and the instruction loading word B would always miss, therefore creating repetitive reads from memory external to the processor. If one assumes a more associative cache in processor P1, more load word instructions requiring the same slot can be added to the code sequence. The resulting read traffic can have the potential effect of blocking the completion of processor P2's store word to A. The likelihood of this blockage increases in a system with a large number of processors if many of the processors are waiting for the value, each executing the sequence shown for processor P1.
As a result, there is a need in the art for a solution that permits a stalled store operation to progress.