Computer architecture generally defines the functional operation, including the flow of information and control, among individual hardware units of a computer. One such hardware unit is the processor or processing engine, which contains arithmetic and logic processing circuits organized as a set of data paths. In some implementations, the data path circuits may be configured as a central processing unit (CPU) having operations that are defined by a set of instructions. The instructions are typically stored in an instruction memory and specify a set of hardware functions that are available on the CPU.
A high-performance computer may be realized by using a number of identical CPUs or processors to perform certain tasks in parallel. For a purely parallel multiprocessor architecture, each processor may have shared or private access to non-transient data, such as program instructions (e.g., algorithms) stored in a memory coupled to the processor. Access to an external memory is generally inefficient because the execution capability of each processor is substantially faster than its external interface capability; as a result, the processor often idles while waiting for the accessed data. Moreover, scheduling of external accesses to a shared memory is cumbersome because the processors may be executing different portions of the program. On the other hand, providing each processor with private access to the entire program results in inefficient use of its internal instruction memory.
One place where a parallel, multiprocessor architecture can be advantageously employed involves the area of data communications and, in particular, the processing engine for an intermediate network station. The intermediate station interconnects communication links and subnetworks of a computer network to enable the exchange of data between two or more software entities executing on hardware platforms, such as end stations. The stations typically communicate by exchanging discrete packets or frames of data according to predefined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP), the Internet Packet Exchange (IPX) protocol, the AppleTalk protocol or the DECNet protocol.
In order to operate efficiently, individual processors in parallel, multiprocessor system must have a mechanism to synchronize. A barrier is a primitive that provides for the synchronization of two or more processors (or processes). When processors need to synchronize, each enters the barrier state. Only when all processors have reached the barrier state may any processor proceed with the execution of subsequent instructions. Barrier synchronization has traditionally been performed by software using, e.g., a semaphore. In a typical implementation of a semaphore, a counter located in a memory stores the number of units of a resource that are free. When a processor accesses the resource the counter is decremented, and when a processor finishes with the resource the counter is incremented. While the counter is at zero, processors simply “busy-wait” until the resource becomes free. A problem with this prior approach is that it consumes memory bandwidth in order to achieve processor synchronization.
When two processors or processes vie for access to a single shared memory resource, a more specialized implementation of a semaphore may be utilized. A single binary semaphore, such as a spin lock, may regulate resource access. A spin lock is a mechanism that allows for orderly access to a shared resource such as a memory. For example, a spin lock may ensure that only one processor or process accesses a segment of the memory at any given time. Each segment of the memory may have a spin lock associated with it and whenever a processor requires access to the segment, it determines whether the spin lock is “locked” or “unlocked”. A locked status indicates that another processor is currently accessing that segment of the memory. Conversely, an unlocked status indicates that the segment is available for access. Thus, when a processor attempts to access a memory segment, it simply tests the spin lock associated with the segment to determine whether that segment is currently being accessed. If not, the testing processor acquires and locks the spin lock to exclude other processes from accessing the segment.
Generally, processors that access a particular segment at the same time compete for acquisition of the spin lock. Processors that fail to gain access typically wait for a period of time before reattempting access. These processors typically enter a finite loop that causes the processors to “spin” (hence the term “spin lock”). A waiting processor continually tests the spin lock until it gains access. A problem associated with this approach is that, as the number of processors competing for the spin lock increases, increased memory contention may arise which, in turn, degrades overall system performance.
In a parallel, multiprocessor system consisting of a systolic array (i.e., an arrangement of processors having a plurality of processors arrayed as rows and columns) a synchronization mechanism may be provided that synchronizes each processor on a “phase” basis. All processors of a column execute substantially the same program instruction code; moreover, all processors of the column access a single share resource (a column memory). Accordingly, a synchronization mechanism is desirable to reduce collisions directed to that shared, column memory resource. For example, assume a phase is 128 cycles in duration and, within each phase, there are a plurality of quarter phase (i.e., 32-cycle) boundaries or “phaselets”. The processors of row 1 start their phase 32 cycles after the start of a row 0 phase, the processors of row 2 start their phase 64 cycles after the start of row 0 and the processors of row 3 start their phase 96 cycles after the start of row 0.
In such an arrangement problems occur due to the fact that, within a phase, it is possible for the processors to become slightly non-synchronized within several clock cycles. Assume a first processor of a row 0 “stalls” because of a collision with a second processor of row 1 within its column on a memory reference to the shared memory resource. As a result, the operation to be performed by the first processor at cycle 1 actually occurs at, e.g., cycle 33. As noted, each processor of a column performs substantially the same work on packets entering the systolic array. Moreover the work performed by each processor may vary among phaselets of a phase. Because the first processor was “pushed out” 32 cycles from cycle 1 of phaselet 1 to cycle 33 of phaselet 2, row 1 of the systolic array has “gotten ahead” of row 0, thereby implying that certain work (actions) have completed that have not yet been performed. This adversely impacts the performance of the systolic array.
The present invention is thus generally directed to a system for synchronizing processors of a systolic array. The invention is further directed to an efficient and accurate means for scheduling resources within the systolic array. In addition, the present invention is directed to a mechanism that guarantees, within a phase, that all processors of a column of the systolic array are at the same relative point within the instruction stream code.