The invention pertains to electronic computers. More particularly, the invention pertains to multiprocessor computers utilizing parallel processing techniques.
Parallel and concurrent processing techniques have been the subject of much theoretical and practical research and development. One reason is that for many classes of problems, e.g. those with large numbers of interacting variables, parallel processing can offer significantly improved performance efficiencies in comparison with traditional sequential processing techniques. Although most, if not all, complex computational problems susceptible of solution could eventually be solved using conventional sequential processing machines, the time ordinarily required to solve these complex problems can be prohibitive.
A survey of several known parallel and concurrent processing techniques may be found in the June 16, 1983 edition of Electronics (McGraw-Hill) at pages 105-114. The article first describes, at page 105, the "classic von Neumann" sequential processing architecture wherein a single processor is fed a single stream of instructions with the order of their execution controlled by a program counter. This classic architecture is commonly discussed as being a major bottleneck for high-speed processing.
A variety of approaches are discussed in the Electronics article for departing from the von Neumann architecture. These approaches include those which use a few very fast or specialized processors and then enhance the control-flow architecture with well-known techniques such as pipelining or vectorizing.
Another approach discussed in the Electronics article, at page 106, is to take a large number of fast or medium-speed processors and arrange them in parallel--perhaps putting several hundred on a single wafer. One known parallel processing architecture is called data-flow. According to the Electronics report, data-flow is a concept for controlling the execution of computer instructions such that they are executed as soon as the input data they require is available. No program counter need be used. Data-flow machines reportedly automatically exploit the parallelism inherent in many problems because all instructions for which data is available can be executed simultaneously, assuming the availability of sufficient numbers of processors.
Several data-flow-type research projects are discussed at pages 107-110 of the Electronics article. One such project is called "Cedar" which utilizes a two-level multiprocessor design. The top level reportedly comprises processor clusters interconnected by a global switching network wherein control is handled by a global control unit using data-flow techniques. At the second level, each processor cluster has local memories and processors interconnected through a local network and controlled, in a conventional von Neumann fashion, with a cluster control unit.
The Illiac IV, which comprised a parallel array of 64 processing units and a Burroughs 6700 control computer is another well-known, albeit commercially unsuccesful, parallel processing project. The machine was designed to process 64 words in parallel but suffered in that it could not get the desired operands in and out of the processing elements (PEs) fast enough.
Researchers at the University of Texas in Austin reportedly have produced a prototype TRAC (Texas Reconfigurable Array Computer) machine based on dynamically coupling processors, input/output units, and memories in a variety of configurations using an intelligent switching network.
Other known parallel processing projects reported in the Electronics magazine article, at page 108, include the Blue CHiP (Configurable Highly Parallel Computer) project at Purdue University in Indiana. In the Blue CHiP project, a collection of homogenous processing elements (PEs) are placed at regular intervals in a lattice of programmable switches. Each PE is reportedly a computer with its own local memory.
Another Purdue University project is known as PASM and focuses on a partitionable array single-instruction-multiple-data and multiple-instruction-multiple-data (SIMD-MIMD) computer. The PASM machine reportedly can be dynamically reconfigured into one or more machines.
Mago, at the University of North Carolina, has designed a binary tree computer with processors at the leaves and resource controllers at the interior nodes and root. The processor cells are reportedly connected directly to their immediate neighbors to facilitate data movement within the linear array. (See page 109 of the Electronics article.)
Denelcor, Inc. of Aurora, Colo., reportedly has developed an AGP system whose architecture sits between von Neumann and data-flow. (Electronics, Feb. 24, 1982, page 161; Electronics, June 16, 1983, page 110.) The AGP system is reportedly a multiprocessor with up to 16 process/execution modules (PEMs) and a shared memory within each PEM. Cooperating programs are pipelined to allow many of them to execute concurrently.
Although many previous attempts have been made to construct multiprocessor systems in order to achieve high computational throughout, these efforts have in general been applied to the designs of general purpose machines (GPMs), with their attendant problems of interprocessor interferences and consequent failure to achieve the expected speed. See, V. Zakharov, IEEE Transactions on Computers, 33, (1984), p. 45. With such GPM systems, processor-processor protocols are very complex since the problem to be solved is (from the designer's point of view) unknown.
Machines to handle large Monte Carlo lattice gas systems in the field of materials analysis have been built. (See, H. J. Hilhorst, A. F. Bakker, C. Bruin, A Compagner and A. Hoogland, J. Stat. Phys., In Press.) Further, at least two molecular dynamics (MD) machines have been designed; one at Delft and the other in England. Two design concepts have been employed: the Dutch machine can be viewed as a single-processor but simultaneous-tasking. Data is pipelined through it very rapidly. The British machines have used 4,096 processors (the ICL DAP)--each element of which is slow and has narrow communication paths. The sheer number of elements provides the speed These machines have a very high performance/cost ratio. They are cheap but achieve speed similar to that of a CRAY for the particular algorithm for which they are built. Since they are dedicated machines, they are used 24 hours a day giving them effectively 24 times the throughput of a CRAY (assuming a lucky CRAY user can get 1 hour CPU time a day). On the other hand, although these machines have proved the potential effectiveness of algorithm oriented machines (AOMs), the MD processors have design limitations. For example, in the case of the ICL DAP, not all algorithms (the MD and Monte Carlo simulations included) may be completely parallelized and the bottleneck in the speed becomes the non-parallel part which has to be performed in a conventional sequential manner. In the case of the Delft machine, the design flaw is that the power of the machine is fixed. Unless many of the machines are put in parallel, the power does not scale with system size; and it is by no means obvious how to do this since the problem then becomes memory-fetch limited. The only way to increase speed in such systems is to spend more money on faster components within the pipeline, where the price will increase rapidly. The other disadvantage of this architecture is that since the algorithm is hardwired, the machine is computationally and algorithmically inflexible. For example, running a three-body force calculation would require a major redesign of the system
There has also been some patent activity in the multiprocessing field.
U.S. Pat. No. 4,092,728 to Baltzer (May 30, 1978) entitled "Parallel Access Memory System" describes a memory system having contiguous storage locations which are partitioned into discrete areas such that several independent processors have exclusive control of their respective memory locations. The processors, in one embodiment, are connected to their respective memory locations through a switching system, or gateway. Baltzer does not disclose a problem solving computer, but rather an information processor for use, e.g., in a television.
U.S. Pat. No. 4,344,134 to Barnes (Aug. 10, 1982) entitled "Partitionable Parallel Processor" describes a system wherein a network of processors operate more or less simultaneously in order to reduce overall program execution time. Specifically, the Barnes patent discloses a hierarchy of processors which operate on discrete units of extended memory based upon preassigned allocations. The mapping scheme is stated by Barnes to be "relatively unimportant to the basic operation" of the U.S. Pat. No. 4,344,134 disclosure.
U.S. Pat. No. 4,074,072 to Christensen et al (Feb. 14, 1978) entitled "Multiprocessor Control of a Partitioned Switching Network By Control Communication Through the Network" discloses a partitioned switching network which is divided into plural edge-to-edge partitions, each partition being controlled by a separate processor coupled to a discrete block of the network. The processors communicate with one another through the network for controlling interpartition calls
U.S. Pat. No. 4,251,861 to Mago (Feb. 17, 1981) entitled "Cellular Network of Processors" describes an information handling system for parallel evaulation of applicative expressions formed from groups of subexpressions. A plurality of interconnected cells, each containing at least one processor, is established in a tree structure. Logic means are provided for connecting the processors within the cells to form disjoint assemblies of the processors, the logic means being responsive to the applicative expression to partition the plurality of interconnected cells into separate disjoint assemblies of processors in which subexpressions can be evaluated. Input/output means are also provided for entering applicative expressions into the cells and for removing results from the cells after evaluation of the applicative expressions. The system disclosed is stated to accommodate unbounded parallelism which permits execution of many user programs simultaneously.
U.S. Pat. No. 4,276,594 to Morley (June 30, 1981) entitled "Digital Computer with Multi-Processor Capability Utilizing Intelligent Composite Memory and Inpt/Output Modules and Method for Performing the Same" discloses a digital computer with the capability of incorporating multiple central processing units utilizing an address and data bus between each CPU and from one to fifteen intelligent composite memory and input/output modules. The disclosure is concerned with data transfer between input/output devices and the CPUs or external devices.
U.S. Pat. No. 4,281,391 to Huang (July 28, 1981) entitled "Number Theoretic Processor" discloses modular arithmetic processors constructed from networks of nodes. The nodes perform various processes such as encoding, modular computation, and radix encoding/conversion. Nodal functions are performed in a parallel manner. The Huang system utilizes table look-up to perform the modular arithmetic. The tables may be stored in memory and the nodes may comprise a microprocessor.
U.S. Pat. No. 4,101,960 to Stokes et al (July 18, 1978) entitled "Scientific Processor" discloses a single-instruction-multiple-data (SIMD) processor which comprises a front end processor and a parallel task processor. The front end processor sets up a parallel task for the parallel task processor and causes the task to be stored in memory. The parallel processor then executes its task independently of the front end processor.
U.S. Pat. No. 4,051,551 to Lawrie et al (Sept. 27, 1977) entitled "Multidimensional Parallel Access Computer Memory System" discloses a parallel access computer memory system comprising a plurality of memory modules, a plurality of processing units, an alignment means for aligning the individual processing units to the individual memory modules in a non-conflicting manner, and means for associating the individual processing units with respective memory modules.
No known prior art parallel processing system takes full advantage of the topology of memory space. However, in many classes of problems, e.g. materials analysis, artificial intelligence, image analysis, solution of differential equations and many defense applications, data processing techniques seek to simulate interactions between variables whose values depend quite closely on the attributes and values of their near neighbors in the simulated system. It is thus desirable, in a concurrent or parallel processing environment, to assign related variables to common processors in order to speed processing. Particularly, it would be advantageous, although no known prior system has done so, to be able to map multi-dimensional simulated systems into partitioned two-dimensional memory space in such a way that the dynamic variables are partitioned based on their dependency to what happens in nearby partitions.
Once data has thus been mapped into "contiguous" partitions, parallel or concurrent processing may then be carried out upon the individual memory partitions using a plurality of processors, each associated with given partitions.
Even though memory may be partitioned in such a way that related variables are stored in the same partition for processing by a dedicated processor, there will undoubtedly be a need to communicate the results of such intra-partition processing to either other dedicated processors or to a master-controller processor for, e.g., sequential processing of the data and for eventual transmission to the user. Although it is possible to conceive routines whereby the dedicated processors communicate with one another and monitor the processing, such interprocessor communications can degrade system performance and result in lost data processing time as the dedicated processors communicate with one another.
Furthermore, the mapping of variables into individual memory partitions to take account of the dependencies therebetween should not ignore the overall dependencies of the system variables upon one another. For example, in a materials analysis simulation, it may be envisioned that a three-dimensional space can be broken down into various memory partitions such that the dynamic variables within each partition depend closely on the values of other such variables within the partition. However, the effect of the individual variables upon the memory space as a whole--particularly upon variables stored within other partitions--is critical to the overall completion of the simulation. Efficiencies in handling such inter-partition dependencies can result in extraordinary savings in processing overhead and hence result in tremendous savings in the most important of all computer processing factors--time.
It is therefore an object of the invention to provide a superfast multiprocessor parallel processing computer.
It is another object of the invention to provide a superfast multiprocessor computer which can operate in concurrent processing or parallel processing modes.
It is another object of the invention to provide a multiprocessor parallel processing computer which takes advantage of the parallelism inherent in many classes of computing problems.
It is another object of the invention to provide a parallel processing computer which takes full advantage of the topology of memory space.
It is a further object of the invention to provide a parallel processing computer in which it is possible to usefully partition dynamic variables so that they depend on what happens in nearby partitions.
It is a further object of the invention to provide a parallel processing computer which allows access by a plurality of processors to a plurality of memory partitions in parallel and in such a way that no data-flow conflicts occur.
It is a further object of the invention to provide a parallel processing computer which is relatively easy to program.
It is a still further object of the invention to provide a parallel processing computer which is modular in design and easily upgradeable, allowing the power of the machine to scale with the size of the problem.
It is a still further object of the invention to provide a multiprocessor parallel processing system wherein a three-dimensional problem may be projected into a two-dimensional space which is in turn mapped into memory/processor space.
It is a still further object of the invention to provide a computer which meets all of the above criteria while remaining low in cost.
SUMMARY OF THE INVENTION
These and other objects of the invention are met by providing a modular, synchronized, topologically-distributed-memory multiprocessor computer comprising non-directly communicating slave processors under the control of a synchronizer and a master processor and a partitionable memory space wherein each slave processor is connected in a topologically well-defined way through a dynamic bi-directional switching system (gateway) to different respective memory areas. The topology of memory space may be planned to take advantage of symmetries which occur in many problems. Access day the slave processors to partioned, topologically similar memory cells occurs in parallel and in such a way that no data-flow conflicts occur. The invention provides particular advantage in processing information and handling problems in which it is possible to partition dynamic variables so that they depend on what happens in nearby partitions. The system may be tied to a host machine used for data storage and analysis of data not efficiently allowed by the parallel multiprocessing architecture. The archicteture is modular, easily upgradeable, and may be implemented at a relatively low cost