To the extent that throughput measures computer performance and total chip-area measures computer cost, the ratio of throughput to area expresses a computer's performance-to-cost ratio. Maximum throughput is often the main objective of computer architecture. Maximum throughput-to-area ratio is a related objective in a world of finite resources: For a given total chip area used in a computer, a higher throughput-to-area ratio implies higher throughput. Equivalently, for a given required throughput, a higher throughput-to-area ratio implies lower total chip-area. Throughput-to-area ratio is especially important in designing computers that will be used for some of those problems demanding the highest possible throughput given a limited implementation budget with respect to total chip-area.
This background discussion presents a sequence of improvements to computer architecture leading from uni-processors to maximum-throughput programmable VLSI-based multiprocessors. Each step in the sequence increases the throughput and/or decreases the cost of the computer. This sequence is meant to be descriptive of one path towards the goal of fast, inexpensive computers, rather than prescriptive of all such paths. The last two steps in this particular sequence are claimed by the inventor. The inventor has established that these last two steps, together, increase considerably the throughput-to-area ratio exhibited by computers solving a broad range of well known and important problems demanding the fastest possible computations.
Typically, a uni-processor contains a variety of computation means, including data storage means for representing variables, calculation means for performing arithmetic operations on those variables in a totally programmable and reprogrammable way, and local control means for specifying the step-by-step operation of the variety of means in a serial manner. Local control means comprises program storage means and program sequencing means. The instructions comprising a uni-processor's controlling program are placed in program storage prior to the outset of the computation. Typically, the instructions contained in the program are applied singly during computation, in a sequence that is determined in part by the values of intermediate computation results. Typically also, a uni-processor is regulated by one system clock, and computation throughput is proportional to the rate of that system clock. The typical diversity of purposes for which a uni-processor is used causes flexibility to be more important than throughput-to-area ratio. In the 1970s, it became possible through integration to place substantial parts of all of the main subsystems of a uni-processor on a single chip. The preferred embodiments of uni-processors have since been microprocessor-based, due to the inherent speed and cost advantages of integration.
Unfortunately, integration alone does not always make computers fast enough, because the electrical characteristics of devices produced in a given chip-making process impose an upper bound on the throughput attainable with a single microprocessor. The need for yet higher throughput motivates the design of parallel computer systems, or multiprocessors, containing large numbers of coordinated and specialized processing elements, or PEs. Typically, each multiprocessor PE may comprise a microprocessor augmented with such inter-PE communication means as are required for the PEs to perform coordinated actions and with such means as required to transfer problem data into and out of the PEs. The most general multiprocessing architecture is known as multiple-instruction stream, multiple-data stream (MIMD), wherein each PE possesses data storage, calculation, and local control means similar to those of a microprocessor, in addition to inter-PE communication means and problem data input and output means.
In commercially realizable form, a MIMD computer comprises a plurality of chips called PE modules, each containing one or more PEs and interfaces to subsystems including inter-PE communication means and problem data input and output means. Each one of a MIMD computer's subsystems belongs to one of two classes: multi-chip subsystems (MCSs) and intra-chip subsystems. MCSs are distinguished from intra-chip subsystems in that each MCS comprises one or more chips and inter-chip wires connecting to at least one PE module. The operation rate of an intra-chip subsystem is not constrained by the typically slow electrical propagation characteristic of inter-chip wires. Typically, MIMD intra-chip subsystems include the data storage, calculation, and local control means individually associated with each PE. In a MIMD computation, the one or more PEs each executes a sequence of intra-chip calculations and transfers selected data to and from MCSs, independently from, but in coordination with, the other PEs. These calculations and transfers are typically regulated by a single system clock, and computation throughput is proportional to the rate of that system clock. While the system clock may be electrically standardized and buffered at each PE module, the system clock represents a single system-wide timing reference.
For some problems, MIMD computer throughput is roughly proportional to the number of PEs. The MIMD PE being an augmented microprocessor-like element, it occupies at least as much chip area as a microprocessor. Therefore, MIMD computer throughput is higher than microprocessor throughput for some problems, although MIMD computer throughput-to-area ratio cannot be appreciably greater than microprocessor throughput-to-area ratio.
Frequently, problems solved by MIMD computations are data-parallel. A data-parallel problem is divisible into a collection of subproblems, each of which is associated with a subset of the problem-defining input data-set. The data subsets associated with disparate sub-problems overlap, and such overlap induces an inter-PE communication requirement when the subproblems have been partitioned among the PEs. How much inter-PE communication is required is proportional to how much the subproblems' data-subsets overlap, and this overlap varies among data-parallel problems. For a given data-parallel problem, choosing a partition that minimizes the amount of required inter-PE communication is important for achieving efficient computation.
Typically, a MIMD computation solving a data-parallel problem is structured as a single program replicated in each and every PE. MIMD computations structured in this way are of sufficient importance to merit designation as a unique class of computation known as single program, multiple-data stream (SPMD) computation. Although SPMD is a specialized method of using a MIMD computer rather than an improvement to the computer itself, SPMD's simplicity in some cases reduces the programming costs associated with computation. SPMD computations are commonly applied in solving demanding data-parallel problems as arise in weather forecasting, nuclear reactor dynamic simulation, pattern recognition, oceanography, seismology, image and signal processing, data compression, data encryption, and specialized mathematical operations on large sets of numbers.
In some SPMD computations, the replicated program executed on every PE progresses in identical sequence on every PE. For such computations, the physically replicated local control means associated with each MIMD PE is redundant. In a single-instruction stream, multiple-data stream (SIMD) computer, the redundant local-control associated with each PE is removed in favor of a single shared control-element called a system controller. (SIMD computation was identified as an alternative to MIMD computation as early as 1972 by Michael J. Flynn, in Some Computer Organizations and Their Effectiveness, IEEE Transactions on Computers, C-21(9):948-960, September 1972, at page 954.) The system controller in a genetic SIMD computer consolidates the PE local control that is replicated in a MIMD computer and is redundant in some SPMD computations. The system controller sequences instructions and broadcasts those instructions via a global instruction broadcast network to each of the plurality of PEs. To allow execution of data-dependent programs (programs wherein the sequence of executed instructions depends on values of intermediate computation results), the system controller also receives status information from the PEs via a response network.
In any multiprocessor computation, the PEs collectively perform the majority of the calculations required to produce the result. The inherent advantage of a SIMD computer is that a maximum proportion of the total chip area is used for PE data storage and calculation means, in preference to having fewer PEs each having associated microprocessor-like local control mechanisms. Compared to its MIMD counterparts, a SIMD PE module realized in a given chip area contains at least 2 times, and perhaps but not limited to 5 times, more chip area allocated to PE data-storage and calculation means.
Neglecting physical constraints arising from MCSs' inter-chip connections and their associated wire delays, it can be assumed that instructions are broadcast to the PEs at the same rate at which a PE can execute them. A SIMD computer thus expectably exhibits maximum throughput-per-area ratio for some problems, by way of maximizing the number of fixed design PEs operating at a given rate on the available chip area.
Despite the apparent inherent advantage of generic SIMD computers, commercial and academic results achieved to date have been disappointing: generic SIMD computers do not exhibit appreciably higher throughput-to-area ratio than similar cost MIMD counterparts. MIMD is currently the commercially favoured architecture for high-throughput programmable multiprocessors. Absent appreciably higher throughput, SIMD is seen as being inferior to MIMD because of the relatively lower programming flexibility resulting from consolidating PE local control into the single SIMD system controller.