A systolic array provides a common approach for increasing processing capacity of a computer system when a problem can be partitioned into discrete units of works. In the case of a one dimensional systolic array comprising a single “row” of processing elements or processors, each processor in the array is responsible for executing a distinct set of instructions on input data before passing it to a next element of the array. To maximize throughput, the problem is divided such that each processor requires approximately the same amount time to complete its portion of the work. In this way, new input data can be “pipelined” into the array at a rate equivalent to the processing time of each processor, with as many units of input data being processed in parallel as there are processors in the array. Performance can be improved by adding more elements to the array as long as the problem can continue to be divided into smaller units of work. Once this dividing limit has been reached, processing capacity may be further increased by configuring multiple rows in parallel, with new input data allocated to the first processor of a next row of the array in sequence.
Typically, such a parallel processor systolic array lacks the buffering capability to handle “large” amounts of data, such as Internet protocol (IP) packets, despite having the required processing power. Accordingly, only portions of the packets are forwarded to the array for processing, while the remaining portions are buffered external to the array. This arrangement relegates the systolic array as an “out-of-band” processor. An example of such an out-of-band systolic array is the processing engine disclosed in U.S. patent application Ser. No. 09/106,478 titled Programmable Arrayed Processing Engine Architecture for a Network Switch, by Darren Kerr et al., which application is hereby incorporated by reference as though fully set forth herein. The processing engine generally comprises an array of processors embedded among an input header buffer (IHB) and an output header buffer (OHB) of a network switch. The processors are symmetrically arrayed as rows and columns, wherein the processors of each row are configured as stages of a pipeline that sequentially execute operations on data passed serially among the processors.
A buffer and queuing unit (BQU) is coupled between the processing engine and a plurality of line cards comprising physical interface ports of the switch. The BQU contains buffers for temporarily storing data, such as IP packets, received from the line cards; thereafter, the BQU delivers portions of those packets to the IHB and stores the remaining portions on a packet memory. The IHB receives the packet portions and distributes them among the parallel pipeline rows for processing by the constituent processors. The OHB receives the processed portions from the pipeline rows and forwards them off the processing engine to the BQU, where they are appended to the remaining packet portions. The packets are then forwarded over appropriate physical interface ports of the line cards from the switch.
When receiving packets from the line cards, the BQU may extract a header from each packet and construct a “context” comprising control information and, e.g., the extracted header. Each context is then forwarded to the IHB for distribution to the processors of the engine. Each context comprises a fixed amount of information that is typically less than that of a packet and that represents a maximum size for which each processor is optimally configured to process. Since contexts are generally smaller than packets, the BQU requires relatively large amounts of storage capabilites to buffer the remaining “payloads” of the packets. These buffering capabilities are external to the processing engine and, as noted, function to relegate the processing engine as an out-of-band processor.
However, the processing engine may be further configured to perform “light” processing on the entire packet, rather than just the packet header. Light processing denotes that the time (i.e., the number of cycles) needed by a processor to process an entire packet is sufficient to meet the rate at which the contexts are provided to the processors of the rows. In other words, the processor can process the context associated with an entire packet (both the packet header and payload) at “line rate”. For this configuration, performance of the processing engine may be enhanced by eliminating external buffering of the packets and, accordingly, the latencies associated with such buffering. Elimination of the external buffering, in turn, may obviate the need for the BQU and memory used to store the packet payloads.