As processors run at faster speeds, memory latency on accesses to memory looms as a large problem. Commercially available microprocessors have addressed this problem by decoupling address computation of a memory reference from the memory reference itself. In addition, the processors decouple memory references from execution based on those references.
The memory latency problem is even more critical when it comes to vector processing. Vector processors often transfer large amounts of data between memory and processor. In addition, each vector processing node typically has two or more processing units. One of the units is typically a scalar unit. Another unit is a vector execution unit. In the past, the scalar, vector load/store and vector execution units were coupled together in order to avoid memory conflicts between the units. It has been, therefore, difficult to extend the decoupling mechanisms of the commercially available microprocessors to vector processing computers.
As multiple parallel processors are used to simultaneously work on a single problem, there is a need to communicate various status between processors. For example, status that each processor has reached a certain point in its processing (generally called a barrier synchronization, since no processor is allowed to proceed beyond the synchronization point until all processors have reached the synchronization point). See, for example, U.S. Pat. No. 5,721,921, which issued Feb. 24, 1998 entitled BARRIER AND EUREKA SYNCHRONIZATION ARCHITECTURE FOR MULTIPROCESSORS, which is incorporated in its entirety by reference. For another example, status that various memory operations specified before that point have completed and various memory operations after that point can rely on the fact that they have completed (generally called a memory synchronization).
What is needed is a system and method for hiding memory latency in a vector processor that limits the coupling between the scalar, vector load/store and vector execution units. Further what is needed is a fast, repeatable, and accurate way synchronizing operations within a processor and across processors.