1. Field of the Invention
The present invention relates generally to microprocessors and more particularly relates to data driven processors.
2. Description of the Relevant Art
Parallel computer architectures and programs are the subject of intensive research and development. One particularly important system under investigation is the data driven processor. In this architecture, a program is represented as a directed graph including collection of points, called nodes, connected by arcs. The nodes represent program operations and the arcs depict the flow of data through the program. This architecture eliminates some of the most significant dataflow bottlenecks plaguing parallel architectures.
Investigations by researchers have shown that data flow bottlenecks caused by the data transfer rate of certain datapaths can degrade the performance of a data driven processor. These bottlenecks are described in an article by Yoshida et al. entitled "A Study of System Composition of Data Driven Microprocessor(2)" at pp. 239 and 240 of the papers of No. 34 National Convention of the Information Processing Society of Japan. March 1987.
Before discussing a particular data flow problem, a brief overview of the operation of a data driven processor will be presented with reference to FIGS. 1-4.
FIG. 1 is a schematic diagram of a data flow program showing the nodes 10 and arcs 12 of the program. In FIG. 1, data packet P1 is received at node 40, copied, and transferred to nodes 42 and 43 and data packet P2 is received at node 41, copied, and transferred to nodes 43 and 44. At node 43 a processing operation is performed that utilizes P1 and P2 as operands and generates P6 as a result. Data packet P6 is received at node 45, copied and transferred to nodes 49 and 50.
In this example, each data packet is copied twice and transferred to two nodes. In a program having a higher degree of parallelism, each data packet would be copied and transferred to more than two nodes.
FIG. 2 is a block diagram of a prior art embodiment of a data driven processor. In FIG. 2, first and second program storage units 30 and 32 are coupled to the inputs of a pair detecting unit 34 by first and second buses 36 and 38. Two outputs of a pair detecting unit 34 are coupled to the input of a processing unit 40 by a first datapath 42. The output of the processing unit 40 is coupled to the inputs of the first and second program storage units 30 and 32 by a second datapath 44.
FIG. 3 is diagram of the structure of the data packets received and processed by the various structural elements of FIG. 2 during the processing at nodes 40, 41 and 43 of the program depicted in FIG. 1.
In FIG. 3, each of the processed packets, PP1 and PP2 50 and 52, include a data field including one data word, W1 and W2 respectively, a destination field designating the nodes requiring the included data field for processing, and the processing to be executed at the designated nodes. In the program of FIG. 1, the destination of field of PP1 50 designates nodes 42 and 43 and that destination field of PP2 52 designates nodes 43 and 44.
In this example, the copying operation at node 40 is executed in the first program storage unit 30 and the copying operation at node 41 is executed in the second program storage unit 32. For PP1, first and second PSU packets 54 and 56 are generated, one for each destination node designated in the destination field of PP1. The data word stored in PP1 is copied to each of these PSU packets 54 and 56. Similarly, the data word stored in PP2 is copied to third and forth PSU packets 58 and 60 in second program storage unit 32.
These PSU data packets 54 through 60 are transferred to the pair detecting unit 34 on buses 36 and 38. The pair detecting unit 34 queues the received PSU packets and identifies pairs of PSU packets having the same node designated in their destination fields. A paired packet 62 is then generated including the words in data fields of the identified pair. For example, in FIG. 3, the first and fourth PSU packets 54 and 60, designating node 43, are paired and W1 and W2 are stored in the paired packet 62. Note that the width of the paired packet 62 is greater than the width of the processed packet 50 and PSU packet 54 because two data words are stored in its data field.
The paired packet 62 is transferred to the processing unit 40 on datapath 42. The processor then performs the processing of node 43 as encoded by the instruction field of the paired packet 62 on the operands W1 and W2 stored in the data field of the paired packet 62. A resulting data field W6 is generated as the result of the processing at node 43, with W6 destined for processing at nodes 49 and 50 after copying at node 45. Accordingly, destination nodes 49 and 50 are designated in the destination field of a processed word generated in the processing unit 34 upon completion of the processing at node 43. Further, the processing to be executed at nodes 49 and 50 is encoded in the instruction field and W6 is included in the data field of the processed packet 64. This processed packet is then transferred to one of the program storage units 30 or 32 on datapath 44. The copying operation at node 45 is then executed in the program storage unit as described above.
The flow of packets on the first and second datapaths 42 and 44 will now be described. The processing unit 40 is designed to process packets at the optimal rate Fmax. Accordingly, paired packets should arrive on the first datapath 42 and exit on the second datapath 44 at the rate Fmax. Because the second datapath 44 supplies two program storage units, the rate of receipt of packets at each storage unit is Fmax/2. However, in the present example, two PSU packets are generated for each received processed packet. Accordingly, the output rate from program storage unit 30 and 32 is Fmax. When the pair detecting unit 34 receives PSU packets at the rate Fmax, it outputs paired packets at about Fmax. Thus, the dataflow throughout the system is balanced.
However, for a program having a higher degree of parallelism, i.e., where a processed packet designates more than two destination nodes' the copying operation in the program storage unit would generate more than two PSU packets for a received processed packet. Thus, the output rate of the program storage 30 unit would exceed Fmax and cause the rate of pair generation to exceed Fmax. As stated above, the processing unit rate of processing cannot exceed Fmax, therefore the processing rate at the program storage unit must be slowed or stopped. However, if this rate is slowed than the output from the processing unit 40 must be slowed to keep from overloading the program storage unit. This circular effect results in a bottleneck at the first datapath 42.
In view of the above, it is apparent that the described dataflow bottleneck limits the degree of parallelism that can be effectively achieved in a data driven processor. A solution to this problem is urgently needed. Further, to achieve large scale integration, it is desired to solve such bottleneck problems in manner that does not add excessive hardware to the system.