This invention relates to MIMD ("multiple instruction multiple data") parallel processing systems. More particularly this invention relates to (1) an enhanced hypercube multiprocessor system architecture, and (2) a SPMD ("single program multiple data"), distributed-memory, image multiprocessor system. The image multiprocessor system has (i) a circuit-switched, reconfigurable connection network, and (ii) fault tolerant network synchronization.
The background is organized as follows:
A. Multiprocessor System Architectures PA1 B. Multiprocessor System Connection Techniques PA1 C. Multiprocessor System Fault Tolerant Synchronization PA1 D. Multiprocessor System Cache PA1 E. Multiprocessor System Applications--Image Processing PA1 1. Packet Switching PA1 2. Circuit Switching
1. Packet Switching PA2 2. Circuit Switching
A. Multiprocessor System Architectures
Multiprocessor systems capable of performing parallel processing are desirable for improving the speed of executing complex tasks. Different architectures are suited for different types of processing. For example, the mesh architecture is suitable for breaking up complex mathematical algorithms into small sub-algorithms. Other architectures implement a pipeline in which each processor is one element in a pipeline. Each processor executes its sub-task over and over passing its result onto the next processor in the pipeline. The algorithm result is achieved from the output of the last processor or a combination of outputs from multiple processors.
The n-cube or hypercube architecture has high potential for parallel execution of processing tasks. FIG. 1 shows hypercube architectures for dimensions n=0 through n=3. A hypercube is characterized as having N=2.sup.n "nodes", where n is the number of dimensions to the cube. FIG. 1a shows a hypercube 10 having dimension n=0, and thus, N=2.sup.0 =1 node 12. FIG. 1b shows a hypercube 14 having dimension n=1, and thus, N=2.sup.1 =2 nodes 16, 18. The nodes 16, 18 are connected by a communication "link" 20. FIG. 1c shows a hypercube 22 having dimension n=2, and thus, N=2.sup.2 =4 nodes 24, 26, 28, 30. In general, an n-cube is formed by connecting 2 cubes of dimension n-1. Thus, the 2-cube 22 of FIG. 1c is equal to two 1-cubes 14', 14" connected together. The two 1-cubes 14', 14" are connected by placing a link between corresponding nodes of each cube 14', 14". Thus, a link 32 couples node 24 (16') with node 26 (16"). A link 34 couples node 28 (18') with node 30 (18").
FIG. 1d shows a hypercube 36 having dimension n=3, and thus 2.sup.3 =8 nodes. The 3-cube 36 is formed by two 2-cubes 22', 22" with links 38, 40, 42, 44 connecting respective nodes of the two 2-cubes.
The hypercube is well suited for parallel processing tasks due to its regularity, symmetry and strong connectivity. It is regular in that the topology is the same for each node. As shown in FIG. 1c each node has the same number of links. It is symmetrical as apparent from visual inspection, making it suitable for highly concurrent multiprocessing where many processors work simultaneously on one problem. It has strong connectivity in that a direct communication route can be established among nodes.
One of the inventive subject matters described in this application relates to an "enhanced" hypercube architecture. Examples of prior hypercube multiprocessing systems include: (1) The Connection Machine by W. D. Hillis, Cambridge Mass.; MIT Press 1985 and corresponding U.S. Pat. No. 5,008,815; (2) Caltech's The Cosmic Cube by C. L. Seitz, Commun. ACM, vol. 28, pp. 22-33, January 1985; (3) Floating Point Systems T series system, Journal of Parallel Distributed Computing, vol. 3, pp. 297-304, September 1986. Several other hypercube multiprocessor systems also are known including the NCUBE, the Ametek, Intel's iPSC, and Thinking Machine Corporation's CM series of computers.
FIG. 2 shows a generalized folding cube (hypercube) conceived by the above named-inventors A. K. Somani and S. B. Choi and first described at the August 1990 International Conference on Parallel Processing. A 3-dimensional generalized folding cube ("GFC") 50 is shown. Each node 52i of the GFC has 2.sup.p processing elements 54 coupled to the hypercube 50 through a switch module 56i. In addition, instead of one link between each node 52, there are 2.sup.p links. FIG. 2 depicts a 3-cube GFC with p=1. Thus, there are N=2.sup.n=3 =8 nodes, 2.sup.p=1 =2 processing elements per node, and 2.sup.p=1 =2 links between nodes. The enhanced hypercube of this invention is an improvement over the GFC offering an efficient network for embedding routing permutations.
B. Multiprocessor System Connection Techniques
The two primary routing techniques employed by multiprocessor connection networks for communication between nodes are packet switching and circuit switching. An inventive subject matter of this application relates to a reconfigurable circuit switching connection network.
Another inventive subject matter relates to a process for determining communication routes (i.e., permutations) in the enhanced hypercube. A routing process implements communication requests between nodes in a multiprocessor system network establishing an appropriate path from every given source node to a respective destination node. The objective of a routing process is to maximize network throughput with minimum cost in terms of path length. Congestion on a communication link is undesirable. To maximize throughput, a routing process provides as many communication paths as possible. To minimize cost, the shortest paths are preferred. A discussion of the two primary routing techniques follows.
In networks implementing packet switching, "packets" of data are queued at intermediate nodes along a route between a source node and a destination node. The packet travels from node to node, releasing links and switching elements immediately after using them. Such "store and forward" operations cause a time delay which can result in significant performance degradation, particularly in the execution of an I/O bound operation. Packet switching typically is a recursive routing process in which each node redirects a message to a next node on a link determined by the exclusive OR (relative address) of a current node address and a destination node address. The routing is very regular and used in a centralized or decentralized fashion. This is a shortest-path type process. However, such routing is unsatisfactory because of high blocking probability and the expense of queues and/or buffers assigned to each outgoing link. The high blocking probability is due to the limited space of a queue, in practice, which eventually causes significant data loss or time delay. A packet switching variation for conventional hypercube architectures is referred to as wormhole routing.
Point to point and broadcast packet switching processes under fault-free and fault-tolerant conditions are described in (1) Optimum broadcasting and personalized communication in hypercubes by S. L. Johnson and C. T. Ho, IEEE Transactions on Comput., vol. 38, pp. 1249-1268, September 1989; (2) Reliable Broadcast in Hypercube Multicomputers by P. Ramanathan and K. G. Shin, IEEE Transactions on Comput., vol. C-37, pp. 1654-1657, December 1988; (3) A Large Scale, Homogeneous, Fully Distributed Parallel Machine by H. Sullivan and T. R. Bashkow, IEEE Proceedings 4th Annual Symposium on Computer Architecture, pp. 105-117, 1977; (4) A Scheme for Fast Parallel Communication by L. G. Valiant, SIAM Journal of Comput., vol. 11, pp. 350-361, May 1982.
In networks implementing circuit switching, a complete, dedicated path from a source to a destination is established through the network before communication begins. A path is formed by one or more communication links. Alternative paths are called routes. Thus, the path is chosen from the pool of routes. If a free route does not exist from end-to-end, communication traffic is blocked and must wait for later transmission. Previously, the free route requirement has been a significant barrier to the use of circuit switching networks.
The process of selecting paths is called routing. At any given time, the aggregate of selected paths implementing inter-node communication is called a connection network "state" or "configuration". The different states of a connection network are called permutations. Given a system of N nodes, a connection network has N! worst-case possible permutations to satisfy (i.e., where one-to-one connection requests occur, and each source node requests connection to a unique destination node). The joining of (i.e., communication between) two links is called a "connection". A network is said to be rearrangable if its permitted states can realize every possible permutation by rearranging existing connections. Rearrangability is acceptable as long as communications are not interrupted (i.e., there is always a free route).
Rearrangability for a hypercube having three or less dimensions is conceived by T. Szymanski in On the Petmutation Capability of a Circuit-Switched Hypercube, Proceeding 1989, International Conference on Parallel Processing, vol. I, pp. 103-110, August 1989. Szymanski presents a simulation in which the permutation capability of a circuit-switched hypercube is examined. His routing process implements an exhaustive search algorithm (i.e., global backtracking depth first search). If any connection in a permutation can not be realized after exhaustively searching through every possible routing option, all connections are erased and the searching continues. A problem with this routing process is that it is too slow for high performance parallel processing systems where routing overhead time is to be minimized. In particular, as larger systems are considered, more overhead time is required to complete the search. For hypercubes having n greater than or equal to 5, the blocking probability approaches one under such type of algorithm.
Rearrangability for a generalized folding cube is conceived by named inventors S. B. Choi and A. K. Somani. See The Generalized Folding-Cube by Sang Bang Choi and Arun K. Somani, 1990 International Conference on Parallel Processing.
C. Multiprocessor System Fault Tolerant Synchronization
Synchronizing a network of multiple processor nodes allows nodes to share resources efficiently, run synchronous programs and for fault tolerant systems, vote on redundant results. Previously algorithms for synchronizing binary hypercubes have been developed for conditions in which no faults are present. See An optimal synchronizer for the hypercube by D. Peleg and J. D. Ullman, Society for Industrial and Applied Mathematics, 18(4) PP. 740-747, 1989. According to such an algorithm, when a node finishes all of its tasks for a cycle the node sends a status indication to the other nodes. A node will only start the tasks of a new cycle when it has received enough messages to ensure that all of the tasks for that cycle have been finished by other nodes. Difficulty arises in the synchronization of a multiprocessor system having a system of nodes asynchronously finishing tasks when the system of nodes is to continue in the presence of faults. Specifically, fault tolerant synchronization in such multiprocessor systems has been a long time problem in the multiprocessor system field.
Conventional ways of synchronizing a network include establishing a global time base using clock synchronization techniques. Multistage synchronizers and phased locked loops are two prior hardware approaches to establishing such a time base. These approaches require fully connected networks. Such solutions keep the internal clocks of the system in step, even in the presence of faults. However, the cost overhead for large multiprocessor systems is high. Even though PLLs have been used in systems with less than fully connected networks, they do not work for systems with connection networks as sparse as the hypercube. See Synchzronization and Matching in Redundant Systems, by D. Davies and J. F. Wakerly, IEEE Transactions on Computers, vol. 27 no. 6, pp. 752-756, 1978; and Ensuring Fault-tolerance of Phased-locked Clocks, by C. M. Krishna, K. G. Shin and R. W. Butler, IEEE Transactions on Computers vol. 34 no. 8, pp. 752-756, 1985; and Clock Synchronization of a Large Multiprocessor System in the Presence of Malicious Faults, by K. G. Shin and P. Ramanathan, IEEE Transactions on Computers, vol. 36 no. 1, pp. 2-12, 1987.
Another prior way of synchronizing global clocks in the presence of faults is by executing interactive convergence or interactive consistency algorithms. While such approaches may work in a some hypercubes, the worst case skew between nodes is large and the communication overhead too expensive for many applications. Even with time stamping of messages to reduce skew, the overhead is still large. See Synchronizing Clocks in the Presence of Faults, L. Lamport and P. M. Melliar-Smith, Journal of the ACM, vol. 32 no. 1, pp. 57-58, 1985; and Hardware-assisted Software Clock Synchronization for Homogeneous Distributed Systems, IEEE Transactions on Computers, vol 39, no. 4, pp. 514-524, 1990.
Accordingly, a fault tolerant synchronization scheme for a multiprocessor system is needed which adds little or no processing overhead. One inventive subject matter of this application is a hardware embodied method of synchronizing sparsely connected multiprocessor systems such as hypercubes.
D. Multiprocessor System Cache
A problem with implementing caches in multiprocessor imaging systems is that excessive memory accesses limit the bandwidth of the memory bus, thereby degrading performance. A high cache hit ratio avoids the bus bottleneck. For morphological processing where the structuring element is large enough, small caches map input data to the same locations. Because data are stored in the same locations they replace each other. Repeated replacement, thrashing, is inefficient. Write through also is inefficient. Algorithms that do limited processing per pixel generate results quickly. If the results are not cached, but written through to main memory, then bus traffic is higher.
Previous cache write schemes include write back, write through, write around and write allocate. See Computer Architecture, A Quantitative Approach by J. L. Hennessy and D. A. Patterson San Mateo, Calif. Morgan Kaufman, 1990; RP3 Processor-Memory Element by W. C. Brantlet, K. P. McAuliffe and J. Weiss, IBM Watson Research Center, IEEE Publ. No. 0190-3918/85/0000/0782, 1985; The IBM Research Parallel Processor Prototype (RP#) Introduction and Architecture, by G. F. Pfister et. al., IBM Watson Research Center, IEEE Publ. No. 0190-3918/85/0000/0764, 1985; The 801 Minicomputer, by G. Radin, IBM Watson Research Center, ACM Publ. No. 0-89791-066-4, 1982.
E. Multiprocessor System Applications--Image Processing
One scheme for categorizing data processing systems is among: (1) single instruction, single data machines (SISD) (i.e., conventional personal computer); (2) single instruction, multiple data machines (SIMD) (i.e., a MASPAR or connection machine one (CM1)); and (3) multiple instruction, multiple data (MIMD) machines. Most parallel processing systems are MIMD machines.
Effective multiprocessing performance depends on the ease of network synchronization, the efficiency in partitioning a problem, the amount of communication among multiple processors, and the computation and communication overlap. Fine granularity into application partitions results in the execution of smaller tasks which run in parallel. Such systems, however, often have a maximum overhead cost (i.e., added processing time delays) in terms of data sharing and system synchronization. Thus, a system which implements fine granularity does not necessarily achieve the fastest solution to a problem. At an increased granularity (i.e., low granularity), communication overhead such as message preparation, transmission, handling and receiving are still large for most applications. At the coarse granularity level multiprocessor systems implement large individual tasks which communicate less frequently. Efficiency is improved as the communication and synchronization overhead amortized over a computation cycle is much smaller per time unit than in fine-grained or low-grained systems.
One aspect of the multiprocessor system of this invention is its large granularity. The system is a single program, multiple data (SPMD) machine. Thus, the instruction granularity is at the program unit. Each processor executes a complete program, instead of dividing it into sub-program units in which each processor executes 3 or 4 instructions.
An example of a prior SPMD multiprocessor machine is the PASM (a research MIMD packet-switching machine which emulates SIMD machines). Previously, SPMD machines have all been packet switched machines. There have been no SPMD circuit-switched multiprocessor systems, and no SPMD circuit-switched image multiprocessor systems.
Accordingly, to effectively perform large image processing tasks, there is a need for a multiprocessor system having: (1) high speed processing elements; (2) coarsely-grained application partitions; (3) low message communication overhead; (4) effective partitioning in which computations absorb data movement latencies; and (5) effective network routing of large messages.