1. Field of Invention
The present invention relates to electronic computers, and, more particularly, to computers with interconnected parallel processors.
2. Prior Art
Major trends shaping real time computation include parallel processing and symbolic processing. Many real time applications require rapid logical decisions using stored knowledge and the processing of large quantities of data at high speed. Moreover, close coupling between the symbolic and numeric computations is often desirable in fields such as speech and image understanding and recognition, robotics, weapon systems, and industrial plant control. Indeed, the widespread use of smaller computers in offices and homes and the emerging disciplines of artificial intelligence and robotics have drawn attention to the fact that an increasing amount of computing effort is spent in non-numeric or symbolic computing: many software tools used with computers, such as editors, compilers, and debuggers, make extensive use of symbolic processing. Symbolic computing leads to new methods of solving problems over and above numerical and statistical approaches because qualitative information or a priori knowledge may be made available in the form of data bases and procedures.
Attempts to solve real world problems requiring human-like intelligence, for example in robotics, speech, and vision, demand enormous amounts of symbolic and numeric computing power because of the vast amount of a priori information required for what are considered to be simple operations and the high data rates from sensors. Indeed, the signal processing of sensor data arises in fields such as acoustics, sonar, seismology, speech communication, biomedical engineering, etc. and the typical purposes of such processing include estimation of characteristic parameters, removal of noise, and transformation into a form which is more desirable. In the past, most signal processors have been tailored for speed and efficiency for a few specific algorithms. Future signal processors will need increased speed and algorithm flexibility, so that algorithms such as high resolution eigensystem beam-forming and optimal Wiener filtering may be computed with the same processor and so that new algorithms may be efficiently implemented as they are developed. The ability to handle a wide range of algorithms in military systems permits different algorithms to be used during a mission and field equipment to be upgraded with new algorithms. Conventional vector approaches cannot satisfy the increasing demand for computer performance and it is necessary that future designs be capable of efficiently utilizing extensive parallelism, see McAulay, Parallel Arrays or Vector Machines. Which Direction in VLSI?, IEEE Publn. 83CH1879-6. IEEE International Workshop on Computer Systems Organization. IEEE Computer Society, New Orleans. March. 1983, L. S. Haynes, R. L. Lau, D. P. Siewiorek, and D. W. Mizell, Computer 15(1), 9(1982), J. Allen, IEEE Proc., 73(5), 852 (1985), and A. D. McAulay, in IEEE Region 5 Conf. Proc., 85CH2123-8, (1985). These references, along with all others herein, are hereby incorporated by reference.
Very large scale integration in semiconductor devices is also leading towards the greater use of parallelism. Parallelism requires some sort of interconnection between the processing elements and this introduces a trade off between speed and the ability to handle a wide range of algorithms. For example, a complex interconnection network provides some flexibility at the expense of speed, and high speed may be achieved by means of fixed interconnections for a specific algorithm. The problem is to achieve very high speed by efficiently using a large number of processing elements and at the same time retain extremely high algorithm flexibility. Efficiency for parallel processing is the gain in speed versus that using a single processor of the same type divided by the number of processors. Also, the complexity of the processing elements relates to the degree of parallelism obtainable; sophisticated computations tend to have parts that are not parallelizable at a coarse level. The overall speed is dominated by the parts which are non-parallelizable at a coarse level. And a large number of fast elementary processors places a considerable communication burden on the interconnection between processors. There is a need for parallel processor interconnections that possess simple reconfigurability.
Currently, most experimental systems have demonstrated the difficulty of achieving parallelism for a range of algorithms with even modest numbers of processors (McAulay, Parallel Arrays or Vector Machines. Which Direction is VLSI?. IEEE Publn. 83CH1879-6, IEEE International Workshop on Computer Systems Organization, IEEE Computer Society, New Orleans, March, 1983). The number of parallel processors (hence speed) which may be used efficiently is limited in today's prototype and proposed systems by the communication delay and interconnection complexity. The constraints imposed by the interconnections on algorithm design are a serious problem because they reduce opportunities to achieve performance by new algorithm design and raise cost by limiting the range of applications and the lifetime of the equipment.
Fixed interconnections limit the range of algorithms which may be efficiently implemented. For example, the limits of the bus structure in parallel computing with the NuMachine has been considered. (McAulay, Finite Element Computation on Nearest Neighbor Connected Machines, NASA Symposium on Advances and Trends in Structures and Dynamics, NASA Langley Research Center, Oct. 22, 1984). Systolic configurations, such as those in development at Carnegie-Mellon University (Kung H. T., Why Systolic Architectures? IEEE Computer, January, 1982 p37-46), use algorithm structure to reduce memory and instruction fetches. This reduces communication time and permits large numbers of processors to be efficiently used in parallel. However, the algorithm constraints are significant because of the fixed interconnections.
Algorithm flexibility may be achieved by complex reconfigurable interconnection networks, (Siegel H. J., Interconnection Networks for Large Scale Parallel Processing. Theory and Case Studies, Lexington Books, 1984) and a prototype system having 8 processors and using a Banyan switch is in operation at the University of Texas at Austin (Browne J. C. Parallel Architectures for Computer Systems. Physics Today, Vol. 37, No. 5, May 1984). A Banyan is a multichannel switch composed of levels of 2.times.2 switches. However, this type of reconfigurability introduces large delays and high control overhead in most proposed systems and this restricts the number of processors and the speed of the system.
The distribution of effort amongst a number of processors does not remove the need for some minimum level of central control, although, for fault tolerance purposes this may not always be the same physical part of the system. The idea of a single program which alone determines the complete operation of the computer is replaced by numerous such programs running concurrently in different processors. The communication channel to the central control must be sufficient to prevent it from becoming a bottleneck. And common memory is frequently used in the process of communicating information from one processor to another. A potential difficulty, memory contention, arises when two or more processors request the same piece of information form a common memory at the same time. Some arbitration is now required and one processor will have to remain idle or make the memory request again later. This increases complexity, cost and inefficiency. A simple example arises in matrix-matrix multiplication where a single row of a first matrix is required in all processors for simultaneous multiplication with each column of a second matrix. Memory contention for such well-defined operations should be taken care of in the computer design.
Great skill is required to partition problems so that various processors complete their tasks at the appropriate time to provide information for the next stage. Synchronization forces everything to wait for the slowest link with resulting inefficiency. A parallel algorithm may involve more steps than a commonly used serial algorithm even though it is more efficient on a specific parallel machine. The overhead reduces the efficiency of the algorithm where efficiency is measured as the speed on the multi-processor divided by the speed with the fastest algorithm on a single processor. The stability and accuracy of the parallel algorithm relative to the serial algorithm must also be considered in comparison.
The communications industry makes widespread use of optical fibers and is developing optical switching devices to avoid conversion to electronics and back for switching purposes. Optics has been suggested for communication with VLSI to overcome the bandwidth pin limitations and edge connection constraints: see Goodman J. W. Leonberger F. J., Kung S. Y. and Athale R. A. Optical Interconnections for VLSI Systems, Proc. IEEE. Vol. 72, No. 7, July 1984, p850-866, and Neff J. A. Electro-optic techniques for VLSI Interconnect, AGARD-NATO Avionics Panel Specialists' Meeting on Digital Optical Circuit Technology, September 1984.
Digital optical computers are expected to eventually become dominant and a design has been proposed for solving a major class of problems, finite elements (see McAulay, Deformable Mirror Nearest Neighbor Optical Computer, to appear in Optical Engineering (1985) and applicant's copending U.S. Appl. Ser. No. 777,660), now abandoned. This design uses deformable mirrors or other spatial light modulators (see Pape D. R. and Hornbeck L. J., Characteristics of the Deformable Mirror Device for Optical Information Processing, Opt. Eng. Vol. 22, No. 6, December 1983, p 675-681). Machines using acousto-optics for matrix algebra operations are in research. These computers, although significant for numerical computation, have limited algorithm felxibility because of the interconnection systems used. They are also not aimed at signal processing applications.
Data Flow has been studied extensively at MIT, SRI and in Japan; see, Arvind and Iannucci R. A. Two Fundamental Issues in Multiprocessing: the Dataflow Solution, MIT Report, MIT/LCS/TM-241, September 1983; Hiraki K., Shimada T., Nishida K., A Hardware Design of the Sigma-1, a Dataflow Computer for Scientific Computations, Proc. IEEE International Conf. on Parallel Processing, August 1984; Jaganathan R. and Ashcroft E. A., Eazyflow; A Hybrid Model for Parallel Processing, Proc. IEEE International Conf. on Parallel Processing, August 1984; Omandi A., Klappholtz D., Data Driven Computation on Process Based MIMD Machines. Proc. IEEE International Conf. on Parallel Processing, August 1984; and Rong, G. G, Pipelining of Homogeneous Dataflow Programs, Proc. IEEE International Conf. on Parallel Processing, August 1984. Permitting operations to occur as soon as the necessary inputs are present is generally seen as a possible means of using parallelism because it avoids the use of a single program counter as in a von Neumann machine. However, there are many proposed forms of data flow machine and there are no major systems in operation today. Texas Instruments has previously developed software and hardware for dataflow systems (Oxley D., Sauber B., Cornish M., "Software development for Data-Flow machines", in Handbook of Software Engineering, C. R. Vick and C. V. Ramamoorthy (Editors), 1984 and U.S. Pat. No. 4,197,589). Problems associated with interconnection and the matching of algorithm and processor are not automatically resolved by the dataflow concept.