The present invention relates to the field of special purpose processing units for use in computer systems and more specifically to communication processors for communicating between individual computers in a multicomputer data processing system.
There have been numerous attempts to improve on the basic Von Neuman computer architecture. The Von Neuman design consists of a central processing unit which is coupled to a memory. The central processing unit is responsible for carrying out the various calculations specified by a program which is stored in the memory. The data used in these calculations is also stored in the memory. The memory consists of a plurality of storage slots, referred to as words. The central processing unit itself has a very small storage capacity. Typically, the central processing unit fetches the next instruction to be executed from the memory, then fetches any data required which is not already in the central processing unit, executes the instruction in question, and stores the result back in the memory. The basic Von Neuman system is limited in speed by the speed of the central processing unit.
One prior art solution to the speed limitations of the basic Von Neuman design involves connecting multiple processing units to the same memory unit. Each of the processing units is connected to a common bus which links the processing units to the memory. Each processing unit runs independentlv of the others. Some form of bus arbitration is used to resolve conflicts between two processing units which seek to control the bus at the same time for the purpose of accessing the memory. The program to be executed by the system is broken into a number of subprograms, each of which is executed by one of the processing units. The ability to improve the speed of the system through this form of concurrent processing is limited by the need to use a common bus to link the memory to each processing unit. If a given processing unit must have access to the memory for 1 clock cycle to obtain an instruction and data needed to keep it busy for 10 clock cycles, then no more than 10 processing units can be productively connected to the bus. Since the individual processing units have very little internal storage capacity, the ratio of memory access cycles to computational cycles is quite large in this type of system.
One prior art solution to the speed limitations of the above described shared bus system is to provide each processing unit with its own memory. In this design, the various processing units communicate with each other over a separate communication link and with their individual memory units over an internal bus. Again, the program is divided into a number of subprograms which are executed by individual processors. Since the individual processing units include a significant amount of memory, the ratio of time needed to communicate on the communication link to the time spent in computation without such communication is much smaller than in the shared bus system described above. This can be seen from the following example.
Consider a simple program having 98 instructions which requires the use of a single data word stored in memory, and the result of the program is a second data word which must be stored back in the memory. In the exemplary situation where 1000 of such data words are to be processed by this program, the memory must be accessed 100 times every time a word is processed. The data word must be fetched from memory. Then the 98 instructions of the program must be fetched which requires the memory to be accessed 98 more times. Finally, the result must be stored.
In a shared bus system, the number of times memory must be accessed to apply this program to 1000 data words is 100,000. If the processing unit has its own memory, then the communication link need only be used to send the 98 instructions once in addition to the 1000 data words to the processing unit and the 1000 resulting data words back from the processing unit. Hence, the communication link need only be accessed 2098 times for the same computation which required 100,000 accesses in the shared bus system. This allows a larger number of processing units to share the same communication link. However, the number can not be made arbitrarily large, since sooner or later the ability of the communication link to service all the processing units will become limiting.
When the communication link becomes limiting, another level of communication link must be established to form a pyramid-like architecture. Two or more systems of processing units, referred to as clusters, can be combined by providing a "super communication link processor" which is used to communicate tasks to each cluster which in turn communicates the tasks over its internal communication link to the individual processing units. This solution to the communication link overload has several problems. First, the super communication link processor can only handle a few clusters. This can be seen as follows: the super communication link has no greater capacity than one of the individual buses in a cluster, since if it were possible to make a super communication link with a greater capacity, that design could also be used in each cluster. Consider the case in which each processor in a cluster receives its work only through the super communication link. The number of processors in the cluster is chosen such that the cluster bus is working at capacity; that is, it is saturated by the communication tasks needed to receive the work for each processor from the super communication link and to return the results of those tasks through the super communication link. But each piece of this data had to come from the super communication link; hence it also must be saturated by the load needed to service this one cluster. Thus, in this case, the super communication link can service only one cluster. This results from the assumption that each processor in the cluster receives its work only from the super communication link. Hence, for a super communication link to service more than one cluster, each cluster must generate and "consume" most of the communication traffic on its internal bus. This requirement limits the use of such pyramid architectures. The large improvement obtained by including a memory with each processing unit has no analogous improvement at the super communication link processor level. This earlier described improvement was the result of removing the need to repetitively transfer the programs to each processing unit over a common memory bus shared by other processing units. Once each of the individual processing units has sufficient memory to store the program and any data which would be repetitively transferred, no further large improvements in the density of communication on the communication link is possible using this type of pyramid architecture.
A second problem inherent in the pyramided cluster approach is the need to introduce a new type of processor, the super communication link processor, as the system is expanded. VLSI fabrication techniques have greatly improved the cost of highly repetitive functional elements such as those used to construct the individual processing units and memories within a cluster. However, the cost of low volume parts used in the super communication link processor can be quite high. The additional level of complexity also leads to an additional level of complexity in the software needed to drive the system. The software must now manage the division of the problem being solved into large pieces to be sent to each cluster as well as into smaller pieces which are to be allocated to each processing unit within a cluster.
Third, each super communication link processor is a potential communication bottleneck. Consider a situation in which two clusters attached to the same super communication link processor must exchange a large volume of information, referred to as messages. This exchange can occupy so much super communication link processor time that there is no time remaining for transmitting messages between other clusters attached to that super communication link processor. This can result in the other clusters running out of work and standing idle, which reduces the system throughput. To avoid this type of problem, a means of rerouting messages through alternate super communication link processors which are not saturated is needed. It is difficult to construct a convenient structure for providing such alternate routing in this type of pyramid architecture.
Finally, this type of pyramid structure is not sufficiently fault tolerant. As the number of processing units in a system is increased, the probability that one processing unit must be placed off-line due to malfunction increases. If the processing unit in question is a super communication link processor, all of the clusters it serviced must also be taken out of service.
Consequently, it is an object of the present invention to provide an improved communication processor and architecture for communicating between processing units in a multiprocessor system.
It is a further object of the present invention to provide a communication processor and communication architecture which may be used to construct multiproceessor systems of arbitrarily large size without the addition of new communication components.
It is a still further object of the present invention to provide a communication network which automatically routes messages around bottlenecks.
It is yet another object of the present invention to provide a communication processor which is fault tolerant.
These and other objects of the present invention will become apparent from the following detailed description of the present invention and the accompanying drawings.