1. Field of the Invention
The present invention relates to a concentrator which merges command/data streams from a plurality of parallel processors into a command/data pipelined stream having the same ordering in which the commands were received by the processors, and more particularly, to a data stream concentrator which also provides attribute switching and direct user access to the command/data pipeline. Also, to facilitate diagnostics testing, the concentrator of the invention allows different data sources and data destinations to be specified by the user via a single command before and during data transfer.
2. Description of the Prior Art
As defined by Black in Data Communications and Distributed Networks, 2nd edition, Prentice-Hall, 1987, pp. 107-108, a concentrator is a device having n input lines, which, if all input devices are active, would exceed the capacity of the output line. The concentrator thus manages the n input lines such that in the event excessive input traffic is beginning to saturate the concentrator, some devices are ordered to reduce transmission or are not allowed to transmit at all. Such a concentrator is often confused with statistical multiplexers, which are often used as a combination concentrator/front end to enable a device to communicate with multiple channels. A statistical multiplexer is often referred to as a port concentrator, which is responsible for control of a line so as to provide buffering, error detection and line synchronization with remote components and to switch the devices to the available ports as required.
Thus, both the concentrator and the statistical multiplexer determine the order in which a plurality of input lines are to be connected to one of several outputs. In general, the order is based on the time order by which data arrives at the input. In addition, the order is also characterized by the data lines having no relationship with each other. However, it is often desirable, as when performing parallel processing, for the input command sequence to be maintained when the outputs of the parallel processors are recombined into sequential order, rather than automatically basing the sequencing on a first come, first serve priority scheme. Moreover, as during parallel processing, there can be and generally is some relationship between the data sources. It is desirable that the aforementioned features of concentrators and statistical multiplexers be available in this environment as well. However, no concentrator has been previously disclosed which provides the aforementioned data management functions of a concentrator while also considering the relationship of the data sources, and this remains a problem in the parallel processing art, particularly in the graphics processing environment.
Parallel processing units (or transform engines) have been used for several years to improve processing speed in the interactive 3-D graphics environment since significant arithmetic processing is required to handle the complex models of 3-D graphics. For the necessary processing efficiency to be realized, a mechanism has been necessary for optimally allocating the input sequential data to one or more of the parallel processing units. Also, when the original data ordering is to be maintained, a mechanism has been necessary for reading the outputs from each of the parallel processing units and resequencing the processed data so that it has the same order as the original data stream. However, such prior art mechanisms have had only limited success.
For example, one technique in the prior art for allocating sequential input data to a plurality of parallel processing units has been to assign one parallel processing unit as a master while the other processing units operate as slaves. The master passes out commands to the slaves generally in accordance with an adaptive load balancing algorithm whereby the master selects a slave which has the least amount of work buffered up to become the next processing unit to receive input. If all slaves are completely out of input buffer space, then the master becomes the active processing unit. The master will remain the active processing unit until a slave has available input buffer space and a minimum number of commands have been executed. The minimum size of the master's block of commands, along with the size of the blocks of commands given to the slaves, may be adjusted to improve processing efficiency. In addition, in order to ensure that the processed data may be resequenced after processing in the assigned processing unit, the master may write to a RAM FIFO a value identifying the slave processing unit to which a particular command has been assigned. The order of the writes into the RAM FIFO enables the master to maintain the same rendering order as the order that the commands were originally received at the master's input buffer.
However, such a technique has the disadvantage that all of the processing units cannot be identical and programmed identically, thereby increasing the costs and complexity of the system. Moreover, if one or more of the slave processing units are kept busy with a complex data processing command, the benefits of parallel processing may be soon lost.
Another technique in the prior art for allocating sequential input data to a plurality of parallel processing units has been described, for example, by Torborg in "A Parallel Processor Architecture for Graphics and Arithmetic Operations", Proceedings of SIGGRAPH, Volume 21, Number 4, July 1987. Torborg therein describes a graphics processor architecture having an arbitrary number of identical processors operating in parallel which are programmed identically as if each processor were a single processor system. In particular, Torborg discloses that parallel processing in up to eight arithmetic processors may be used for front end geometric and arithmetic operations to improve processing time for an interactive 3-D graphics system. Torborg also observes that the graphics commands must be adequately distributed among the processors for efficient processor utilization and that the multiple parallel processors must produce the same apparent results as a single processor performing the same operations.
For implementing his system, Torborg observes that many graphics commands are order-independent and hence their processing and rendering order may be changed without affecting the display. However, for those graphics commands which are not order-independent, Torborg proposes to delay the processors which are processing other commands until all processors are synchronized before processing the sequential command. Torborg indicates that due to the buffering of the processor output that this synchronization has a minimal effect on processing efficiency.
Torborg further proposes to transfer pipelined data to the parallel arithmetic processors whenever data is available and the appropriate arithmetic processor is ready for data. The graphics commands are distributed to the arithmetic processors depending upon whether the inputted command is a global command which is to be sent to all arithmetic processors, a command which is sent to the arithmetic processor most ready to accept the command as determined by a command arbitration mechanism, or a command which is to be sent to a specific arithmetic processor as specified within the graphics command. Command arbitration is used to determine which arithmetic processor should receive the command if it can be processed in parallel. The arbitration mechanism attempts to fairly distribute commands between processors in order to improve processing efficiency by giving priority to processors which are most ready to accept the command. For this purpose, Torborg discloses that each processor may have a command input buffer which is used to buffer commands from a display list manager. The buffer is deep enough that it can contain several commands simultaneously, and the commands may be requested on different priority levels depending upon the amount of data in its input buffer. Distribution priority is then given to the processors executing the commands which take the least amount of time to process as indicated by the status of the input buffers.
Sequencing is generally maintained in the system of Torborg by maintaining a small tag FIFO in each processor for keeping track of the ordering of all sequential commands being processed by all arithmetic processors and all commands being processed by the particular processor containing the tag FIFO. A two bit entry in the tag FIFO is used by Torborg to indicate whether the command is being processed by the particular arithmetic processor containing the tag FIFO and whether the command is a sequential command. The output of the tag FIFO is used to insure sequential processing of all order dependent commands and to control the order in which the processed data is transferred to an image memory unit for subsequent display. In particular, the output of the tag FIFO is used to control arbitration on the bus by which the parallel graphics processing units are connected to the image memory unit. For example, if the two control bits of the tag FIFO indicate that the command is not sequential, then the output controller will request the bus as soon as a complete command block is available. In this case, the order in which commands are transferred to the image memory unit will depend on the processor load and command distribution of all processors. The tag FIFO output will be clocked after each command group associated with a single graphics input command is transferred over the bus to the image memory unit.
However, if the tag FIFO indicates a sequential command, the output controller will wait until all other arithmetic processor output controllers have reached the same point in the original command stream for purposes of synchonization. In other words, the output controller will wait until all arithmetic processors reach the entry in their tag FIFOs corresponding to a particular sequential command, thereby synchronizing the arithmetic processors, before the sequential command is output for processing. Since every processor in the system of Torborg places an entry into its tag FIFO for every sequential command (even if the command is not executed by that processor), all processors' tag FIFOs will indicate a sequential command but only one will indicate that the command was processed by that processor. The processor which has processed the sequential command will then request the bus and send the group of commands associated with the graphics input command to the image memory unit. Once this command transfer has completed, the tag FIFO output on all arithmetic processors will be clocked. As noted above, Torborg indicates that processing efficiency is maintained in this system since the processor core can still continue to transfer commands into the output FIFO of the processors while the output controller synchronizes all of the processors to maintain sequentiality.
Thus, in the system of Torborg, a mechanism is provided for assigning sequential commands to a plurality of parallel processors and then recombining the processed outputs into the original ordering using values stored in a tag FIFO for output bus arbitration. However, the tag values must be maintained for each processor independently, and sequentiality must be maintained by synchronizing the outputs of all processors when a sequential instruction is received. As a result, processing efficiency can become quite low, especially when several global (sequential) instructions are received in succession. Moreover, the command stream cannot be reordered to allow maximum use of parallel processing capabilities. Thus, when a complex input graphics primitive such as B-spline patches having a potentially long execution time compared with their input buffer size are received, processing inefficiency may be compounded.
Prior art data processing systems are also inefficient in that it is quite difficult to insert a command/data packet into the data stream after it has been resequenced since, as noted above, prior art concentrators are typically time based and must be resynchronized to allow command/data packet insertion. To avoid this problem, prior art concentrators instead pass repetitious data into the data stream without providing direct pipeline access. Conversely, prior art concentrators allow the various inputs to communicate with each other to determine whether downstream data attributes need to be changed. If so, an additional command/data packet containing the desired attributes is input into the system, processed by one of the parallel processors and sent downstream. Such prior art techniques are ineffective and quite complicated.
As noted above, concentrators enable the processing system to send data from one or more sources to one of several outputs. Typically, a plurality of data stages are provided for the switching. However, such prior art concentrators generally do not allow the input data streams to be routed by a single command specifying one source and one or more destinations, for the sources and destinations are generally chosen for their availability. Moreover, such prior art systems typically do not concurrently evaluate whether desired data source devices and data destination devices are ready for data transfer. As a result, further processing efficiency is lost.
Accordingly, there is a long-felt need for a concentrator which can merge a plurality of data streams from a plurality of inputs so as to form a command/data pipeline as well as permit command/data packet insertion and source/destination designation without adversely affecting the efficiency of the processing system. The concentrator of the invention has been designed to meet these needs.