1. Field of the Invention
This invention relates in general to systems and methods for high speed floating point computations involving long vectors or large matrices of data, and in particular to a parallel processing system and method employing an SIMD architecture for performing such computations.
2. Discussion of Related Art
In the 1950's and 1960's computers normally had a single central processing unit (CPU) and Von Neumann architecture. These computers were also typically constructed of discrete components and later simple integrated circuit components. Many of these components were very large and slow by today's standards and required many discrete or "off-chip" interconnections, which tended to reduce reliability significantly. On account of these and other factors, a premium was placed on minimizing the number of components required for the circuitry of the CPU and the arithmetic logic unit (ALU) within the CPU which performed the elementary mathematical computations such as addition and multiplication. Bit serial techniques for addition and multiplication were often used in preference to parallel addition and parallel multiplication techniques because, although much slower, bit serial circuitry used far fewer components than the parallel adders and multipliers. Since the bit serial circuitry in the ALU was usually sufficiently fast to keep up with other speed-limiting circuitry and other operations in the CPU, this did not normally present a problem.
In the early 1970's advances in integrated circuit fabrication techniques and design enabled the size and cost of fairly large circuits such as parallel adders and parallel multipliers to be reduced. It also allowed computers to operate with faster cycle times or instruction speeds. As a result, parallel adders and parallel multipliers virtually completely supplanted bit serial designs in Von Neumann machines. The use of the parallel adders and multipliers has continued with respect to supercomputers as
The continuing demand for faster and powerful computers is driven in large part by many scientific and engineering applications which require very extensive computations, particularly computations involving arrays or matrices of floating point numbers. To meet this demand, a number of supercomputers have been developed and are commercially available for solving problems involving large amounts of floating point computations such as the CRAY 2S, and IBM 3090. In these systems, numerical operations are performed by floating point processing hardware. Numbers are transferred to the processing hardware via wide busses where the number of wires in the bus is greater than or equal to the number of bits in the floating point number, which is generally 64 bits. Floating point operations are performed by the processing hardware in one or two clock cycles and the results are transferred out on wide busses. In such systems which operate at hundreds of million floating point operations per second (Mflops), there is a critical bandwidth problem in coupling the processor with main computer memory. The problem is severe since each floating point operation can require three memory accesses: two for input operands, and one for the output result. To sustain such high speed operations, a complex system of registers, high speed cache memory, crossbar switches, and bussing to larger but slower main memory is required. All of the interconnections to these devices must be done with busses which are as wide as the data word. The most expensive part of a supercomputer is generally in the data communication pathways between the processor and main memory.
Another difficult problem in the design of supercomputers is the software controlled structure which supports the data communication. The microcode software must control the registers, cache memory, and crossbar switches so that algorithms can proceed most efficiently. The controller needed to support the simultaneous multiple tasks is very complex.
Numerical computations have been divided into two classes: vector and scaler. Vector operations involve performing the same operation on a long string of numbers, such as multiplying each element of one string by each element of another. Scaler operations are individual operations which can not be vectorized. Supercomputers are generally designed to handle both types of operations as efficiently as possible, although the vector operations are generally an order of magnitude faster in Mflops. The programs found in scientific or engineering problems involving, for example, fluid flow, temperature flow, and stress in mechanical structures largely use vector oriented operations. Thus, a system optimized for solving vector problems is very useful.
Current supercomputers are scaled to operate at an even higher speed by adding several processors which operate in parallel, thus compounding the complexity of the control system and the bandwidth problems from main memory to the floating point processors.
It has long been recognized in the computer industry that parallel processing architectures have the possibility of providing increased computation speeds and power at lower cost. Because of this, numerous parallel computer architectures have been proposed and studied and a number of different types of parallel processing computer systems have been built or are under construction.
There has been some success in parallel processing systems to date, but most, if not all, of the truly notable success has been achieved in specialized applications such as image processing. A good portion of this success is directly attributable to specialized parallel processing systems which are custom designed to meet the specific requirements of the intended application. For example, the assignee of the present invention has enjoyed success in designing parallel processing systems for image processing applications, and is continuing to work in this area. Recent advances made by the assignee of the present invention have been realized by building a parallel processing system along the lines described in my copending U.S. patent application Ser. Nos. 057,128 and 057,182 respectively entitled "Linear Chain of Parallel Processors and Method of Using Same" and "Neighborhood Processing System and Method", which were both filed on Jun. 1, 1987. Both of these copending applications are hereby incorporated by reference, as they are illustrative of the generally advanced state of the art, and of the parallel processing systems now being marketed by the assignee of the present invention, and as they have a number of attributes in common with the basic architecture of the processing system of the present invention. The attributes in common principally are: (1) the use of a linear chain of identical individual interconnected processing units, each of which has a processing element or cell; (2) the use of a local memory with each unit, the memory being arranged in a "multiple row-single column" format. However, as just mentioned, the systems described therein are directed to image processing applications and are not well suited for performing floating point computations at extremely high rates of speed, as is required for supercomputer applications.
The results achieved thus far in the field of vector operations using true parallel processing computer architectures have been mixed at best. Typically, such systems have not lived up to the expectations of providing greater processing speeds and power at a lower cost. Moreover the costs and complexity of both the hardware and software for such systems have proven to be significant. One summary of the many on-going activities in the parallel computer architecture is provided in the following articles of J. Bond, "Parallel-processing concepts finally come together in real systems", Computer Design, pp. 51-74 Jul. 1, 1987). The article indicates that a number of such systems utilize pipelined vector processors or floating point coprocessors at each node or processing cell, and are still rather complex and expensive. A number of these systems thus rely upon conventional floating point solutions implemented upon either conventional or custom-designed integrated circuit chips.
None of the many parallel processing systems discussed in the aforementioned article appear to use bit serial techniques for performing mathematical operations such as addition, multiplication or floating point computations. Moreover, none of the systems appear to provide a parallel computer architecture on a single printed circuit (PC) board which is capable of performing at 100 Mflops or better. Finally, most of the computer architectures discussed as being suitable for vector operations such as floating point computations seem to have a complexity approaching that of the more conventional supercomputers discussed above. Thus, it appears that a new approach to vector operations and floating point computations may be required if parallel processing systems are to live up to their expected potential of delivering more processing speed and power at lower cost.
Computational requirements in scientific problems such as chemical and nuclear modeling, and engineering problems such as turbulent airflow in aircraft design, will lead to computer requirements exceeding 10.sup.12 floating point operations per second (Teraflops). The present method of computer design will not be able to meet these requirements. The complex designs will have to give way to massively parallel SIMD designs because the instruction cycle times for even moderately parallel architectures will exceed the capabilities of picosecond logic, and even exceed limitations of the speed of light.
Therefore, a primary object of the present invention is to provide a low cost system capable of handling vector oriented problems at a minimum speed of hundreds of Mflops, with a potential of hundreds of thousands of Mflops.
Another object of the invention is to eliminate the costly hardware needed to support the high bandwidth of data between the floating point processors and main memory by using a parallel processing system which employs local memories connected to the individual floating point processing cells.
A further object of the invention is to simplify the type of controller structure which supports the processor and memory system, to provide for simpler methods of controlling floating point operations carried out thereby.
A still further object of the invention is to obtain higher speeds by adding more floating point processors without compounding the complexity of the system.