The present invention relates to a computer system and particularly to a parallel processor system adapted to carrying out computationally intensive procedures.
A number of computer applications involve executing scientific algorithms on large arrays of data. Such algorithms are commonly referred to as matrix algorithms and have several significant characteristics: they typically operate on multi-dimensional data arrays, they can be naturally paralleled and may be broken down into blocks, and they involve many computations per data point. Since most general purpose computer systems are adapted to single scalar operations and do not perform array computations efficiently, special computer architectures have been developed that reduce the time necessary to process large arrays of data. However, array processing for complex algorithms is still comparatively time consuming.
Traditional vector supercomputers perform matrix algorithms by decomposing them into a series of vector instructions, reading a vector from memory, performing an arithmetic operation such as a multiply-add, and storing the result back into memory. Operating in this manner often results in only two floating-point operations for every three memory accesses. In order to increase the operation speed of such a series of vector operations, faster memory systems can be employed, but faster memory systems can be prohibitively expensive. Memory bandwidth, i.e., the ability to move data between memory and processors, is typically the greatest expense in building a computer system.
There are many other factors that limit the speed at which a computer system can process an array of data. However, eliminating computer speed roadblocks in certain areas does not always yield the most cost effective means for increasing processing power. Sophisticated board technologies can minimize capacitance, propagation delays and noise, but the goal in developing computer systems is often to achieve high performance at low cost, providing the simplest means for implementing application software.
Another alternative for increasing overall computer processing capacity is increasing the number of processing elements used in the system. Employing twice the number of processors has the potential of reducing process time by a factor of two over a single processor system. However, depending on the application, the time required to perform an algorithm may not be reduced in proportion to an increase in the number of processing elements. Since many computer applications require completion of a first operation before a second operation can begin, a second processor that would normally perform the second operation often must remain idle until the first operation is finished. Such requirements can result in the dual processor computer system requiring approximately the same time to complete a procedure as would be required by a single processor system.
Computer architectures can also limit processing capacity in that bottlenecks during data transfer can cause processing elements to idle. For example, in a single bus computer system, only one processor may access system memory at a time, while other processors connected to the bus idle until the processor presently using the bus completes its transfer. One method of reducing system bottlenecks is to provide computer architectures that perform a few specific algorithms quickly, but their application is limited. Computer bottlenecks may also be reduced by adding buses and allowing multiple accesses to memory at the same time. These methods add cost and complexity to the system and are still restricted by the type of algorithms performed.
Bus conflicts may be reduced with software employing machine code that will coordinate memory accesses for plural processors in a system. However, this method can become too complex to perform effectively as the number of processors increases and, in addition, the software must be rewritten for each system configuration change. Using semaphores for system coordination is counterproductive because accesses to system memory for reading and setting semaphores consume precious memory bandwidth.
A substantial problem encountered with computers employing multiple processors involves tracking the processors and the points at which a data bus is available for subsequent data transmission to global memory. Software compilers for sorting the data and processing commands to each processor have been employed in an attempt to maximize processing capacity by reducing data bus conflicts. However, as the number of processors increases, the multiprocessor compilers become less efficient in allocating bus time between processing elements. Since the bus use for each processor depends on the algorithm that is presently being performed, software apportionment is complex and it is difficult to attain maximum system processing capacity.
FIG. 1A illustrates a multiprocessor bi-directional bus architecture system with a common global memory as found in the prior art. Interface processor 10, multiple data processing elements 12, and global memory 14 are all connected to bus 16. Instruction data will typically enter the system on bus 16 from mass storage through I/O processor 10. Processor 10 transmits incoming code into the global memory 14. Data for each processor is also transmitted over bus 16 to global memory 14. Each processor may perform a portion of the entire algorithm, or depending on the algorithm and the amount of data, may perform the same algorithm on a different section of the data.
The system of FIG. 1A illustrates typical architecture for a multiprocessor work station. Such a work station employs inexpensive bus architecture, thereby providing an economical system. However, the single bus between the central processing units and memory impedes serious supercomputing.
Referring now to FIG. 1B, also comprising a block diagram of a vector supercomputer in accordance with the prior art, a plurality of processors 12 are each coupled to a separate I/O interface 10 and also connect to crossbar 18 via multiple crossbar/processor ports 20, each processor coupling to one or more crossbar/processor ports 20. The crossbar connects the processors to memory 14 through multiple crossbar/memory ports 22. Crossbar 18 employs a complex multi-layered crossbar scheme according to the prior art for connecting the multiple processors 12 to memory bank 14. This complex crossbar and the memory interconnections required for such a system configuration, while effective in enhancing system performance, can be prohibitively expensive.
FIG. 1C is a block diagram of a relatively simpler computer system illustrating a prior art architecture for increasing bandwidth by allowing concurrent access to multiple ports in the same global memory array. Crossbar 18 connects multiple data processing elements 12 to global memory 14 by means of crossbar/processor ports 20 and crossbar/memory ports 22, each processor having its own dedicated crossbar/processor port. Memory 14 is provided with multiple input ports, each memory port being coupled to a single crossbar/memory port. Crossbar 18 decodes address values from each processor, connecting the data bus of the processing element that asserted a value on an address bus to the associated memory port in memory 14. The data on the processor data bus is transferred to or from the memory location through the memory port associated with the address value. An I/O interface 10 is also provided with a crossbar port. While such a system provides increased processing speed, the cost associated with supporting memory transfer bandwidth for each processor on the multiple memory ports is high relative to the gained computing speed. The systems of FIG. 1B and FIG. 1C illustrate typical structures wherein one or more crossbar ports are dedicated to each processor.