1. Field of the Invention
The present invention relates to the field of parallel processing computer systems.
2. Prior Art
A number of parallel processing computer systems are well known in the prior art. Generally, in such systems a large number of processors are interconnected in a network. In such networks each of the processors may execute instructions in parallel. In general, such parallel processing computer systems may be divided into two categories: (1) a single instruction stream, multiple data stream system (SIMD) and (2) a multiple instruction stream, multiple data system stream (MIMD) system. In a SIMD system, each of the plurality of processors simultaneously executes the same instruction on different data. In MIMD system, each of the plurality of processors may simultaneously execute a different instruction on different data.
In either SIMD or MIMD system, some means is required to allow communication between processors in the computer system. In such systems it is known to organize logically organize processors in an n-cube. A discussion of such n-cube systems may be found in Herbert Sullivan and T. R. Bashkow, A Large Scale Homogeneous, Fully Distributed Parallel Machine, Proceedings of the 4th Annual on Computer Architecture, pp. 105-117, 1977. Sullivan et al. discusses a number of interconnection structures including connection of processors on a boolean n-cube. The described boolean n-cube is an interconnection of N (N=2.sup.n) processors which may be thought of as being placed at the corners of an n-dimensional cube. Sullivan et al. discloses the location of a processor may be described by designating one processor as the origin with a binary address of (0,0, . . . 0) of n bits. Other processors may then have their locations expressed as an n bit binary number in which each bit position is regarded as a coordinate along one of the n-dimensions. In such a system, when one processor is directly linked to another, their addresses will differ in just one bit. The position of this bit indicates the direction in n-space along which communication between the processors takes place. Thus, the address of one processor with respect to a neighboring processor differs by only one bit.
Sullivan et al. describes that in such a system a relative address may be computed by taking the bit-by-bit sum (modulo 2) of the addresses of two processors. This bit-by-bit summation is the equivalent of taking an exclusive OR of the two addresses. The number of non-zero bits in the resulting relative address represents the number of links which must be traversed to get from one processor to another.
U.S. Pat. No. 4,598,400 Hillis describes a similar n-cube parallel processing computer system in which an array of nodes are interconnected in a pattern of two or more dimensions and communication between the nodes is directed by addresses indicating displacement of the nodes. Hillis specifically discloses a system in which a message packet may be routed from one node to another in a n-cube network. The message packet comprises relative address information and information to be communicated between the nodes.
Many known parallel processing computer systems utilize a store-and-forward mechanism for communicating messages from one node to another. The Hillis system describes such a store-and-forward mechanism. Such store-and-forward mechanisms are more clearly described in Parviz Kermani and Leonard Kleinrock, Virtual Cut-Through: A new Computer Communication Switching Technique, Computer Networks, Vol. 3, 1979, pp. 267-286. Kermani et al. distinguishes store-and-forward systems from circuit switching systems. Specifically, a circuit switching system is described as a system in which a complete route for communication between two nodes is set up before communication begins. The communication route is then tied up during the entire period of communication between the two nodes. In store-and-forward (or message) switching systems, messages are routed to a destination node without establishing a route beforehand. In such systems, the route is established dynamically during communication of the message, generally based on address information in the message. Generally, messages are stored at intermediate nodes before being forwarded to a selected next node. Kermani et al. further discusses the idea of packet switching systems. A packet switching system recognizes improved utilization of resources and reduction of network delay may be realized in some network systems by dividing a message into smaller units termed packets. In such systems, each packet (instead of message) carries its own addressing information.
Kermani et al. observes that extra delay is incurred in known systems because a message (or packet) is not permitted to be transmitted from one node to the next before the message is completely received. Therefore, Kermani et al. discloses an idea termed "virtual cut-through" for establishing a communication route. The virtual cut-through system is a hybrid of circuit switching and packet switching techniques in which a message may begin transmission on an outgoing channel upon receipt of routing information in the message packet and selection of an outgoing channel. This system leads to throughput times exactly the same as in a store-and-forward system in the case of all intermediate channels being busy. In the case all intermediate nodes being idle, this system leads to throughput times similar to a circuit switched system. However, the system disclosed by Kermani et al, still requires sufficient buffering to allow an entire message to be stored at each node in the case of channels being busy.
W. J. Dally, A VLSI Architecture for Concurrent Data Structures, Ph.D Thesis, Department of Computer Science, California Institute of Technology, Technical Report 5209, March 1986, discusses a message-passing concurrent architecture to achieve a reduced message passing latency. In Chapter 3, Dally discusses a balanced binary n-cube architecture.
In Chapter 5, Dally discusses an application for reducing message latency. In general, Dally discloses use of a wormhole routing method, rather than a store-and-forward method. A wormhole routing method is characterized by a node beginning to forward each byte of a message to the next node as the bytes of the message arrive, rather than waiting for the next arrival of the entire packet before beginning transmission to the next node. Wormhole routing thus results in message latency which is the sum of two terms, one of which depends on the message length L and the other of which depends on the number of communications channels traversed D. Store-and-forward routing yields latency which depends on the product of L and D. (See Dally at page 153).
A further advantage of a wormhole routing method is that communications do not use up the memory bandwidth of intermediate nodes. In the Dally system, packets do not interact with the processor or memory of intermediate nodes along the route, but rather remain strictly within a routing chip network until they reach their destination.
However, the Dally discloses a self-timed system, permitting each processing node to operate at its own rate with no global synchronization. (See Dally at page 153).
Dally at pages 154-157 further discloses a message packet containing comprising relative X and Y address fields, a variable size data field comprising a plurality of non-zero data bytes and a tail byte.
It is desired to develop an improved method of communication between nodes in a parallel processing computer system.
As another objective of the present invention, it is desired to develop a parallel processing computer system having reduced message passing latency and increased node-to-node channel bandwidth.
As another object of the present invention, it is desired to develop a system which efficiently passes messages without requiring buffering for message packets at each node.
As another object of the present invention, it is desired to develop a system in which data communicated within a system is controlled by a clock communicated with the data.