1. Field of the Invention
The present invention generally relates to multiprocessor systems and, more specifically, to the order by which multiple processors execute instructions.
2. Background Information
High-performance computer systems often utilize multiple processors or central processing units (CPUs). Each processor may have access to shared and/or private data, such as program instructions, e.g., algorithms, that are stored in a memory coupled to the processors. In addition, each processor may support one or many threads, where each thread corresponds to a separate instruction or execution sequence. One of the more common multiprocessor architectures is called a systolic array in which each processor is coupled to its nearest neighbors in a mesh-like topology, and the processors perform a sequence of operations on the data that flows between them. Typically, the processors of a systolic array operate in “lock-step” with each processor alternating between a compute phase and a communicate phase.
Systolic arrays are often used when the problem being solved can be partitioned into discrete units of works. In the case of a one dimensional systolic array comprising a single “row” of processors, each processor is responsible for executing a distinct set of instructions on input data so as to generate output data which is then passed (possibly with additional input data) to a next processor of the array. To maximize throughput, the problem is divided such that each processor requires approximately the same amount of time to complete its portion of the work. In this way, new input data can be “pipelined” into the array at a rate equivalent to the processing time of each processor, with as many units of input data being processed in parallel as there are processors in the array. Performance can be improved by adding more processors to the array as long as the problem can continue to be divided into smaller units of work. Once this dividing limit has been reached, processing capacity may be further increased by configuring multiple rows in parallel, with new input data allocated to the first processor of a next row of the array in sequence.
One place where multiprocessor architectures, such as systolic arrays, can be advantageously employed is in the area of data communications. In particular, systolic arrays have been used in the forwarding engines of intermediate network stations or nodes, such as routers. An intermediate node interconnects communication links and subnet-works of a computer network through a series of ports to enable the exchange of data between two or more end nodes of the computer network. The end nodes typically communicate by exchanging discrete packets or frames according to predefined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP) or the Internetwork Packet eXchange (IPX) protocol. The forwarding engine is often used by the intermediate node to process packets received on the various ports. This processing may include determining the destination of a packet, such as an output port, and placing the packet on an output queue associated with the destination.
Intermediate nodes often employ output queues to control the flow of packets placed into the computer network. In a typical arrangement, the output queues are configured as first-in-first-out (FIFO) queues where packets are placed (enqueued) at the end (tail) of the queues and removed (dequeued) from the beginning (head) of the queue. Placement and removal often entails accessing the queue, which includes writing and reading the packet or information related to the packet, such as a packet header, to and from the queue.
In some intermediate nodes, packets are enqueued and dequeued by the forwarding engine. In intermediate nodes that employ forwarding engines containing multiple processors, the output queues may be treated as shared resources, meaning that more than one processor can access a given queue at a given time. One problem with shared resources, however, is that packets received by the intermediate node in a given order may be processed and forwarded in a different order.
To resolve this problem, a systolic array can be configured to guarantee first-in-first-out (FIFO) ordering of context data processing. As used herein, context data or “context” is defined as an entire packet or, more preferably, a header of a packet. According to FIFO ordering, the contexts processed by the processors of the rows of the array must complete in the order received by the processors before the rows of the array advance. Each processor is allocated a predetermined time interval or “phase” within which to complete its processing of a context. When each processor completes its context processing within the phase, this control mechanism is sufficient. However, if a processor stalls or otherwise cannot complete its processing within the phase interval, all processors of the array stall in order to maintain FIFO ordering. Here, the FIFO ordering control mechanism penalizes both the processors of the row of the stalled processor and the processors of the remaining rows of the multiprocessor array.
For most applications executed by the systolic array, FIFO ordering is not necessary. However, FIFO ordering may be needed to maintain an order of contexts having a dependency among one another. Packets that correspond to the same “application flow” or more simply “flow” often need to be treated as having a dependency on each other. A flow is defined as a sequence of packets having the same layer 3 (e.g., Internet Protocol) source and destination addresses, the same layer 4 (e.g., Transport Control Protocol) port numbers, and the same layer 4 protocol type.