The present invention relates generally to the field of computer systems and more particularly to computer processing elements for use in concurrent computing systems.
Although general purpose computing systems of the Von Neumann type have made great advances in both computing speed and cost per computation through improvements in VLSI circuit design and fabrication, these systems are still too slow to perform many real time computational problems. Computer applications in the signal processing area often require more than a billion calculations per second. This is far above the through-put of currently available Von Neumann computers.
The classical Von Neumann computer consists of a memory connected to a central processing unit. Instructions and data are fetched from the memory by the central processing unit which is responsible for essentially all of the computational tasks. The typical central processing unit is capable of executing hundreds of different instructions. However, it can not execute these simultaneously. Hence, at any given time, most of the circuitry in the central processing unit is idle, since the central processing unit typically executes only one instruction at a time. In addition to reducing the cost effectiveness of the central processing unit, this idle circuitry reduces the speed at which the central processing unit can operate. The need to include this circuitry on the computer chip results in a larger chip with longer connecting paths between the various processing elements. These longer signal paths have significant parasitic capacitances which limit the speed at which they can be driven. Hence, as the size of the central processing unit is increased, the maximum clock rate at which it can run is decreased. This further reduces the cost effectiveness of the central processing unit.
In addition, the memory from which the central processing unit fetches instructions and data is typically located on a separate chip and hence is also limited in speed by the capacitances of the signal paths. This limitation can be reduced somewhat by including a small fast cache memory on the central processing unit chip for holding instructions and/or data which would otherwise be repeatedly transferred between the central processing unit and the large system memory. However, the size of the cache memory needed to relieve the problems introduced by the off chip system memory may be too large to be included on the central processing unit chip.
These limitations of the classical Von Neumann computer design have been overcome to some degree in the prior art by designing special purpose computing hardware which is optimized for a particular computational task. For example, a common problem in signal processing involves the construction of a digital filter which functions in an analogous manner to an analog band pass filter. The computations involve forming a sum of the products of the digital signal multiplied by weighting factors depending on the time at which each digital signal value was measured. This may be accomplished by constructing a linear array of N processors. At any given time, the Kth processor in the linear array contains the value of the digital signal as it existed K clock periods earlier. During each clock period, each processor computes one term in the sum, i.e., the product of the digital signal value it currently is storing and a constant inputted to it on a separate signal line and stored with the processing element. The result of this computation in the Kth processor is then added to result from the (K-1)th processor and passed on the (K+1)st processor together with the digital signal value that was stored in the Kth processor. A new digital signal value is inputted to the first processor after start of each clock period. The value of the sum outputted from the Nth processor is used to calculate the filtered signal value corresponding to the digital signal value.
The problem with this type of special purpose hardware is its lack of versatility. For example, the array of processors described above cannot be easily reconfigured to perform a calculation requiring two multiplies and an addition in each processor. Similarly, if the particular problem does not require all N stages, the unused stages cannot be used to perform other multiplication and addition steps. As a result, these special purpose processors only have a high efficiency as measured by computations per second per square micron of integrated circuit area for a small class of problems.
There have been a number of attempts to construct more general purpose processors which avoid the limitations of the classical Von Neumann computer. The vector processing computer described by Cray (U.S. Pat. No. 4,128,880) is typical of such a computer. This computer is optimized for repetitive calculations involving a small number of operations which are to be performed successively on each element of one or more vectors. For example, the process may involve adding corresponding elements of two vectors and then storing the result in the corresponding element of a third "result" vector. The data making up the vectors is transferred to a set of vector registers in this special purpose computer from a main memory which is usually part of a large computing system in which this special purpose system is incorporated. This architecture provides a substantial improvement over the classical Von Neumann architecture for a number of reasons. First, the vector registers provide a high speed memory system optimized for transferring successive elements of one or more vectors to one of a plurality of function units which performs the desired calculation and then transferring the results back to one of the vector registers. This reduces the time needed to transfer data back and forth between the slower system memory, since data that is repeatedly used is held in the vector registers until it is no longer needed. Hence, the need to transfer the same data back and forth between the slower system memory and the central processing unit is significantly reduced. Second, the instructions needed to carry out the operations on the vectors need not be repeatedly transferred between the system memory and the vector processor. Third, the function units may be optimized for the specific calculation. This allows smaller chip areas to be used and hence higher clock rates.
This type of vector processor may be reconfigured to a limited degree which makes it applicable to a broader class of problems than the signal processing computer described above. In the "chaining mode" described by Cray, the results from one vector operation which are stored in a result vector register are immediately available as operands to a second function unit which may perform computations concurrently with other function units.
This type of vector processor, however, suffers from three significant problems. First, it is a special purpose system which is only optimized for a specific limited class of computational problems, i.e., those involving applying a small computational program successively to each element in one or more vectors. It is inefficient at carrying out computations not in this class, and there is no way to reconfigure it when a problem for which it is not optimized is encountered. For example, if the vectors in question are too long to fit into the vector registers, there is no simple way of combining two registers to form one long register. Similarly, if the code needed to carry out the computations does not fit in the internal memory allocated for code storage, there is no way to utilize free memory in the vector register area to provide additional code storage space. In these cases, the calculation must be broken into sub-calculations which are run in tandem on the processor.
Second, it is difficult to configure such a system such that all the various function units operate concurrently. If the particular computational program does not utilize all of the function units present in the processor, there is no practical method for applying the idle computational power to another part of the overall program running on the main computer system to which the vector processor has been connected.
Finally, this type of vector processor may not be efficiently combined with other such processors to form a processing array similar to that described above with regard to digital filtering. There are numerous situations in which the optimum processor configuration consists of an array of processors in which each processor performs the same computation, but on different data. The digital filtering example is such a case. Because of the high costs inherent in designing and testing a new VLSI circuit, considerable economies of scale can be realized if an array of processors is used rather than constructing one large special purpose processor having the equivalent number of function units. This is particularly true when the individual processors are of a sufficiently general nature that they may be applied to a wide variety of problems. In such a case the design and initial fabrication costs can be spread over a large number of parts thus allowing significant economies of scale to be obtained. To obtain the maximum economies of scale in this case, the replicated processor unit should contain as little control circuitry as possible, since this control function can be applied at the array level by a single control processor which services all processors in the array, thus eliminating the need to replicate this control hardware in each processor.
The vector processor design described above contains considerable control circuitry which is designed to allow it to run independently of system control for significant periods of time. This includes memory for storing the code of the program to be executed and instruction decoding circuitry which is different for different instructions. At most, an array of such vector processors requires one copy of this circuitry. The unnecessary replication of this circuitry requires larger computer chips which in turn leads to slower clock rates as well as higher design and construction costs. Furthermore, arrays of processors of this type would suffer from input/output bottlenecks, since one bus is used for transferring data and instructions to and from each processor.
Broadly, it is an object of the present invention to provide a reconfigurable computer processor.
It is a further object of the present invention to provide a reconfigurable computer processor which contains a minimum amount of control circuitry.
It is a still further object of the present invention to provide a computer processor that may be efficiently combined with other such processors to form a processing array which may be controlled by a single controller.
These and other objects of the present invention will become apparent from the following detailed description of the present invention and the accompanying drawings.