1. Field of the Invention
The present invention relates to a method and apparatus for distributing serial instruction stream data to parallel processors, and more particularly, to a method and apparatus for distributing a serial stream of command/data packets among a parallel array of processors for processing so that the data processing outputs of the processors can be readily recombined into a serial stream with the same ordering as the original stream.
2. Description of the Prior Art
It is well known to use a plurality of processing units in parallel to improve processing speed. For example, for applications such as interactive 3-D graphics which require significant arithmetic processing in order to handle complex models, parallel processing units have been used for several years in order to improve processing efficiency. For such processing efficiency to be realized, a mechanism has been necessary for optimally allocating the input sequential data to the parallel processing units. Also, when the original data ordering is to be maintained, a mechanism has been necessary for reading the outputs from each of the parallel processing units and resequencing the processed data so that it has the same order as the original data stream. However, such prior art mechanisms have had only limited success.
For example, one technique in the prior art for allocating sequential input data to a plurality of parallel processing units has been to assign one parallel processing unit as a master while the other processing units operate as slaves. The master passes out commands to the slaves generally in accordance with an adaptive load balancing algorithm in which master selects a slave which has the least amount of work buffered up to become the next processing unit to receive input. If all slaves are completely out of input buffer space, then the master becomes the active processing unit. The master will remain the active processing unit until a slave has available input buffer space and a minimum number of commands have been executed. The minimum size of the master's block of commands, along with the size of the blocks of commands given to the slaves, may be adjusted to improve processing efficiency. In addition, in order to ensure that the processed data may be resequenced after processing in the assigned processing unit, the master may write to a RAM FIFO a value identifying the slave processing unit to which a particular command has been assigned. The order of the writes into the RAM FIFO enables the master to maintain the same rendering order as the order that the commands were originally received at the master's input buffer.
However, such a technique has the disadvantage that all of the processing units cannot be identical and programmed identically, thereby increasing the costs and complexity of the system. Moreover, if one or more of the slave processing units are kept busy with a complex data processing command, the benefits of parallel processing may be soon lost.
Another technique in the prior art for allocating sequential input data to a plurality of parallel processing units has been described, for example, by Torborg in "A Parallel Processor Architecture for Graphics and Arithmetic Operations", Proceedings of SIGGRAPH, Volume 21, Number 4, July 1987. Therein describes a graphics processor architecture having an arbitrary number of identical processors operating in parallel which are programmed identically as if each processor were a single processor system. In particular, Torborg discloses that parallel processing in up to eight arithmetic processors may be used for front end geometric and arithmetic operations to improve processing time for an interactive 3-D graphics system. Torborg also observes that the graphics commands must be adequately distributed among the processors for efficient processor utilization and that the multiple parallel processors must produce the same apparent results as a single processor performing the same operations.
For implementing his system, Torborg observes that many graphics commands are order-independent and hence their processing and rendering order may be changed without affecting the display. However, for those graphics commands which are not order-independent, Torborg proposes to delay the processors which are processing other commands until all processors are synchronized before processing the sequential command. Torborg indicates that due to the buffering of the processor output that this synchronization has a minimal effect on processing efficiency.
Torborg further proposes to transfer pipelined data to the parallel arithmetic processors whenever data is available and the appropriate arithmetic processor is ready for data. The graphics commands are distributed to the arithmetic processors depending upon whether the inputted command is a global command which is to be sent to all arithmetic processors, a command which is sent to the arithmetic processor most ready to accept the command as determined by a command arbitration mechanism, or a command which is to be sent to a specific arithmetic processor as specified within the graphics command. Command arbitration is used to determine which arithmetic processor should receive the command if it can be processed in parallel. The arbitration mechanism attempts to fairly distribute commands between processors in order to improve processing efficiency by giving priority to processors which will be ready to accept the command soonest. For this purpose, Torborg discloses that each processor may have a command input buffer which is used to buffer commands from a display list manager. The buffer is deep enough that it can contain several commands simultaneously, and the commands may be requested on different priority levels depending upon the amount of data in its input buffer. Distribution priority is then given to the processors executing the commands which take the least amount of time to process.
Sequencing is generally maintained in the system of Torborg by maintaining a small tag FIFO in each processor for keeping track of the ordering of all sequential commands being processed by all arithmetic processors and all commands being processed by the particular processor containing the tag FIFO. A two bit entry in the tag FIFO is used by Torborg to indicate whether the command is being processed by the particular arithmetic processor containing the tag FIFO and whether the command is a sequential command. The output of the tag FIFO is used to insure sequential processing of all order dependent commands and to control the order in which the processed data is transferred to an image memory unit for subsequent display. In particular, the output of the tag FIFO is used to control arbitration on the bus by which the parallel graphics processing units are connected to the image memory unit. For example, if the two control bits of the tag FIFO indicate that the command is not sequential, then the output controller will request the bus as soon as a complete command block is available. In this case, the order in which commands are transferred to the image memory unit will depend on the processor load and command distribution of all processors. The tag FIFO output will be clocked after each command group associated with a single graphics input command is transferred over the bus to the image memory unit.
However, if the tag FIFO indicates a sequential command, the output controller will wait until all other arithmetic processor output controllers have reached the same point in the original command stream for purposes of synchonization. In order words, the output controller will wait until all arithmetic processors reach the entry in their tag FIFOs corresponding to a particular sequential command, thereby synchronizing the arithmetic processors, before the sequential command is output for processing. Since every processor in the system of Torborg places an entry into its tag FIFO for every sequential command (even if the command is not executed by that processor), all processors' tag FIFOs will indicate a sequential command but only one will indicate that the command was processed by that processor. The processor which has processed the sequential command will then request the bus and send the group of commands associated with the graphics input command to the image memory unit. Once this command transfer has completed, the tag FIFO output on all arithmetic processors will be clocked. As noted above, Torborg indicates that processing efficiency is maintained in this system since the processor core can still continue to transfer commands into the output FIFO of the processors while the output controller synchronizes all of the processors to maintain sequentiality.
Thus, in the system of Torborg, a mechanism is provided for assigning sequential commands to a plurality of parallel processors and then recombining the processed outputs into the original ordering using values stored in a tag FIFO for output bus arbitration. However, the tag values must be maintained for each processor independently, and sequentiality must be maintained by synchronizing the outputs of all processors when a sequential instruction is received. As a result, processing efficiency can become quite low, especially when several sequential instructions are received in succession. Moreover, the command stream cannot be reordered to allow maximum use of parallel processing capabilities. Thus, when a complex input graphics primitive such as B-spline patches having a potentially long execution time compared with their input buffer size are received, processing inefficiency may be compounded.
Accordingly, there is a long-felt need for a command distributor which can distribute serial instruction stream data commands to identical parallel processing units for processing so that the processed data can be recombined so as to maintain sequentiality without synchronizing all processors for each sequential command, with the resultant loss in processing efficiency. Moreover, there is a long-felt need for a command distributor which may split complicated commands for distribution to a plurality of parallel processors for processing without destroying the sequentiality of the output data stream. The present invention has been designed for these purposes.