Many of the known types of apparatus for performing parallel processing are reviewed and discussed in Parallel Computers 2, Architecture, Programming and Algorithms, by R. W. Hockney and C. R. Jesshope, published in 1988 by Adam Hilger, Bristol, England, and Philadelphia, U.S.A., and a number of experimental computers are compared in an article entitled "A Survey of Proposed Architectures for the Execution of Functional Languages" by Steven R. Vegdahl, published in IEEE Transactions on Computers, Vol. C-33, No. 12, Dec. 1984, pages 1050 to 1071.
In a classical von Neumann computer architecture processing is carried out in a strictly sequential manner, the architecture having a single control unit, a single arithmetic and logic unit, and a memory in which program instructions and other data are stored in a sequence of addressable locations. During execution of a program, one instruction is called up at a time and executed. The address of the next instruction must either be provided by an instruction counter that simply counts through a regular numerical sequence of addresses, or by data supplied from the memory in the execution of the current program step. Such strictly sequential processing is a disadvantage in many circumstances, and attempts have been made to develop architectures which are not so limited. Attempts to avoid the sequential restrictions imposed by sequential programs have resulted in new memory structures which are to be operated on by one or more control units each having its own arithmetic and logic unit. Two examples of the latter development are described in U.S. Pat. Nos. 3,646,523 and 4,075,689 issued to Klaus Berkling. It is sometimes implied that designing a central control unit that operates without an instruction counter will lead to the elimination of the so-called von Neumann bottleneck, but in fact the bottleneck exists in processing apparatus which has a central control unit without an instruction counter, as can be seen from pages 34 and 35 of Automatic Digital Calculators, by A. D. and K. H. V. Booth, published in 1956 by Butterworths Scientific Publications, London, where it is pointed out that if each instruction contains the memory location of the next, the effect on the design of the central control unit is to eliminate the instruction counter.
The two United States patents mentioned hereinbefore, U.S. Pat. Nos. 3,646,523 and 4,075,689, both issued to Klaus Berkling, described early examples of reduction machines. A more recent example of a reduction machine architecture in which processing and memory are separate is described in U.S. Pat. No. 4,591,971, issued to John Darlington et al., and in an article entitled "Declarative languages and program transformation for programming parallel systems; a case study" by J. Darlington, M. Reeve, and S. Wright, in Concurrency: Practice and Experience, Vol. 2(3), pages 149 to 169, Sep. 1990.
A further attempt to avoid the disadvantages of strictly sequential processing has been the development of systems which have a plurality of von Neumann processors, each with its own central processing unit (CPU) and local memory, interconnected by a specially designed bus or network. Since each processor is inherently an independent processing entity, considerable effort is required in designing interfacing between the individual processor and the network and in the control and organisation of data transfer between the processors. Also, because of the so-called contention problem, the design of the interconnecting network has an effect on the efficiency of cooperation between the processors and hence on the extent to which the processing capabilities of the individual processors can be utilized. An example of such a system is described in an article entitled "Hierarchical Routing Bus" by T. Sueyoshi and I. Arita, in Systems and Computers in Japan, Vol. 16, No. 6, 1985, at pages 10 to 19, and in an article entitled "Performance Evaluation of the Binary Tree Access Mechanism in MIMD Type Parallel Computers" by T. Sueyoshi, K. Saisho, and I. Arita in Systems and Computers in Japan, Vol. 17, No. 9, 1986, at pages 47 to 57. The latter articles describe a shared-memory parallel processing system in which processor modules, each comprising a processor unit and a memory unit, are interconnected by a binary tree access mechanism. Each module has a system address. The address space of the system is represented by a two-dimensional address composed of the system address and a location in the module having that system address, so that a single address space is formed. Each processor unit can access any memory unit via the binary tree access mechanism. However, an instruction fetch can be made only from the memory unit within the module that contains the processor unit carrying out the instruction fetch. Thus each memory unit is the local memory for its own processor unit, and global memory for the other processor units. Another tree-type routing network for parallel processing is described in an article entitled "Fat-Trees: Universal Networks for Hardware Efficient Supercomputing" by C. E. Leiserson, at pages 393 to 402 of the Proceedings of the 1985 International Conference on Parallel Processing, published by IEEE Computer Society Press, and a tree-type local network is described in IBM Technical Disclosure Bulletin, Vol. 25, No. 11B, Apr. 1983, at pages 5974 to 5977, by P. A. Franaszek.
Several parallel processing architectures are outlined in Byte, Nov. 1988, at pages 275 to 349. Amongst those mentioned there is a hypercube architecture known as the connection machine, which is also described in "The Connection Machine" by W. D. Hillis at pages 86 to 93 in Scientific American, Vol. 256, No. 6, Jun. 1987, and in U.S. Pat. Nos. 4,598,400 and 4,814,973 issued to W. D. Hillis. In the connection machine, hypercube architecture is employed in the structure of an array of 32768 identical integrated circuits each containing 32 identical processor/memories, so that there are 1,048,576 identical processor/memories. Each processor/memory is connected to its four nearest neighbours. The direction of data flow through the array is controlled by a microcontroller of conventional design. Also, each integrated circuit is provided with logic circuitry to control the routing of messages through a Boolean n-cube of fifteen dimensions into which the integrated circuits are organised. Within each integrated circuit, bus connections are provided to the thirty-two processor/memories so that each processor/memory can send a message to every other processor/memory in that integrated circuit. To permit communication through the Boolean 15-cube, the connection machine is operated so that it has both processing cycles and routing cycles. Computations are performed during the processing cycles. During the routing cycles, the results of the computation are organised in the form of message packets, and these packets are routed from one integrated circuit to the next by routing circuitry in each integrated circuit in accordance with address information that is part of the packet. In the packet, the integrated circuit address information is relative to the address of the destination integrated circuit. The routing circuitry in all the integrated circuits is identical and operates in synchronism using the same routing cycle. Passage of a message packet from a source integrated circuit to a destination integrated circuit is effected by the routing circuits of the integrated circuits. Each routing circuit comprises a line assigner, a message detector, a buffer and address restorer, and a message injector. The line assigner comprises a fifteen by fifteen array of substantially identical routing logic cells. Each column of the array of routing logic cells controls the output of message packets in one dimension of the Boolean 15-cube. Each row of this array controls the storage of one message packet in the routing circuit. The message detector, buffer and address restorer, and message injector of each routing circuit comprises fifteen sets of processing and storage means corresponding to the fifteen rows of routing logic cells. Thus the connection machine, although having a large plurality of processor/memories instead of separate areas of processing and memory, relies on complex auxiliary routing control arrangements. A further aspect of routing in such a machine is described in international patent application publication no. WO89/07299 of Thinking Machines Corporation (inventor W. D. Hillis) which describes an array of processors and an interconnection network controlled by a control unit in the form of a Symbolics 3600 Series LISP machine and a microcontroller. Another example of a processor array with interconnection controlled by a separate control unit is described in international patent application publication no. WO87/01485 of The University of Southampton (inventors C. R. Jesshope, P. S. Pope, A. J. Hey, and D. A. Nicole) and uses transputers as processors. Cube networks for MIMD and SIMD processing in distributed systems are discussed generally in an article entitled "The Multistage Cube: A Versatile Interconnection Network" by H. J. Siegel and R. J. McMillen, at pages 65 to 76, Computer, Dec. 1981.
Another approach to parallel processing has been that of providing an interconnected array of processors where the interconnection is designed to correspond to a distribution of tasks into which a computation is to be resolved. Such an approach has as its background the development of programming languages known as applicative or functional programming languages, which was in particular stimulated by an article entitled "Can programming be liberated from the Von Neumann Style?. A functional style and its algebra of programs" by J. Backus at pages 613 to 641 in Communications of ACM (1978), No. 21. The functional programming languages are closely based on a formal notation known as the lambda calculus. Lambda calculus was originally described in the Calculi of Lambda-Conversion by Alonzo Church, first published in 1941 by Princeton University Press, with second printing in 1951. The pure Church Lambda calculus is described in Introduction to Combinators and .lambda.-Calculus by J. Roger Hindley and Jonathan P. Seldin, published in 1986 by Cambridge University Press, Cambridge, England, and New York, U.S.A. The significance of the lambda calculus in relation to functional programming is discussed in Functional Programming by Anthony J. Field and Peter G. Harrison, published in 1988 by Addison-Wesley Publishing Company, Wokingham, England, Reading, Massachusetts, U.S.A., and Tokyo, Japan. A particular feature of the lambda calculus is a form of reduction known as Beta reduction, which is explained in section 1C of Introduction to Combinators and .lambda.-Calculus, and section 6.2 of Functional Programming. A functional program for a computation can be resolved recursively into a tree structure of sub-tasks, and the final result of the program be independent of the order in which these sub-tasks are evaluated. One example of the design of an array of processors corresponding to a distribution of tasks into which a functional program can be resolved is described in an article entitled "A Reduction Architecture for the Optimal Scheduling of Binary Trees" by K. Ravikanth, P. S. Sastry, K. R. Ramakrishnan, and Y. V. Ventatesh, at pages 225 to 233 in Future Generations Computers Systems, No. 4, 1988, published by Elsevier Science Publishers B. V. (North Holland). In the latter article there is described an array of eight processors so interconnected that a binary tree of computing tasks can be mapped onto the array. The interconnections conform to the relationships expressed by EQU L(Pi)=P2imodN and R(Pi)=P(2i+1)modN for i=0, 1 . . . , N-1,
where N=8, Pi is the (i+1)th processor of N identical processors, L means left-hand child, and R means right-hand child. It is assumed that the computation decomposes itself recursively into identical subproblems (tasks), and that every task down loads the two subtasks it spawns onto its immediate neighbours. Each processor in the network has four neighbours, two connected to paths coming into the processor, and two connected to paths going out from the processor. The memory of each processor is divided into three banks: a left-memory; a right-memory; and a local-memory. The local-memory is local to its own processor and contains all programs, relevant tables, etc. Each processor communicates with its left child through its own left-memory, and with its right child through its own right-memory. Thus a rigid system of communication between processors, which moreover is limited to communication with immediate neighbours, is imposed. Other tree arrays of processors with rigid systems of communication are also described in "A Network of Microprocessors to Expedite Reduction Languages", by G. A. Mag, at pages 349 to 385 and 435 to 471, in International Journal of Computer and Information Sciences, Vol. 8, 1979, "A Cellular Computer Architecture for Functional Programming", by G. A. Mago, at pages 179 to 187, 1980, IEEE, "Making Parallel Computation Simple: The FFP Machine", by G. Mago, 1985, IEEE, and U.S. Pat. Nos. 4,251,861 (issued to G. A. Mago) and 4,583,164 (issued to D. M. Tolle). Also, in "Comparing Production System Architectures" by M. Lease and M. Lively, of Computer Science Department, Texas A&M University, College Station, Texas 77843, reference is made to an array of 1023 processors connected to form a complete binary tree designed and built at Columbia University in the City of New York and known as DAD02. Such an array of processors is described in U.S. Pat. No. 4,843,540 issued to S. J. Stolfo and again relies on communication between nearest neighbours in the binary tree.