Numerous examples of single instruction, single data path processors exist. Intel™, MIPS™, ARM™ and IBM™ all produce well-known versions of these types of processors. In recent years, in the continuing push for higher performance, these standard processors have grown to include multiple execution units with individual copies of the registers and out-of-order instruction processing to maximize the use of the multiple execution units. In addition, many of these processors have increased the depth of their instruction pipelines. As a result, most the execution units become underutilized when the processing becomes serialized by load stalls or branches. In addition, much of the computational capability of these execution units, which have grown from 16 to 32 and on up to 64 bits per word, is wasted when the required precision of the computation is significantly less than the size of the words processed.
On the other hand, array processor architectures also exist. CDC™ and later SGI™ produced notable versions of these types of computers. They consist of a single instruction unit and multiple execution units that all perform the same series of functions according to the instructions. While they are much larger than single instruction, single execution processors, they can also perform many more operations per second as long as the algorithms applied to them are highly parallel, but their execution is highly homogeneous, in that all the execution units perform the same task, with the same limited data flow options.
On the other side of the computing spectrum there exist re-configurable compute engines such as described in U.S. Pat. No. 5,970,254, granted Oct. 19, 1999 to Cooke, Phillips, and Wong. This architecture is standard single instruction, single execution unit processing mixed with Field Programmable Gate Array (FPGA) routing structures that interconnect one or more Arithmetic Logic Units (ALUs) together, which allow for a nearly infinite variety of data path structures to speed up the inner loop computation. Unfortunately the highly variable, heterogeneous nature of the programmable routing structure requires a large amount of uncompressed data to be loaded into the device when changes to the data path are needed. So while they are faster than traditional processors the large data requirements for their routing structures limit their usefulness.
This disclosure presents a new processor architecture, which takes a fundamentally different approach to minimize the amount of logic required while maximizing the parallel nature of most computation, resulting in a small processor with high computational capabilities.