In the field of large, very high performance computers, usually referred to as supercomputers, a vector processing architecture is usually provided in order to achieve the very high data processing rates required for extremely computationally intensive applications such as modeling of physical phenomena. An example of a supercomputer vector processing architecture is disclosed in U.S. Pat. No. 4,128,880 by Seymour R. Cray, and assigned to Cray Research, Inc. In that architecture, a plurality of vector registers are used to hold the vectors for sending to a functional unit and for receiving and temporarily holding result vectors from functional units. For maximum speed individual vector elements are transmitted as operands from vector register to functional unit at the rate of one element per clock period, and individual result elements for the result vector are transmitted from the functional unit at the same rate. In this manner, once the start-up time or functional unit time has passed, the functional unit can provide successive results of successive operations for each clock period. Because the actual number of clock periods required to complete a single calculation is generally several clock periods, fully segmented functional unit designs are used. In a segmented design, all information arriving at the unit or moving within the unit is captured and held at the end of every clock period. Of course, the number of capture and hold operations for a given functional unit depends upon the type of unit, i.e., integer ADD, floating point, multiply, logical operations, etc., as well as the number of logic levels between latches. This is referred to as the functional unit time, and in general it is desirable to keep the functional unit time short not only because it affects the start-up time for beginning to produce results in a vector operation, but also because it has a significant effect on scalar operations. On the other hand, reducing the number of clock periods in the functional unit time might cause an increase in the number of levels of logic between successive latches, which in turn could dictate a slower clock time to allow for propagation and settling of signals. It is therefore necessary to achieve a balanced design between clock speed and functional unit time for the segmented functional unit.
The need for high-speed operation in supercomputers has usually resulted in designs wherein the critical components including functional units are implemented in small or medium scale emitter coupled logic (ECL) integrated circuits. Such devices are characterized by very high switching speed, high power consumption and heat dissipation, and moderate scale of integration. Very large scale integration (VLSI) gate arrays which have found widespread use in many computer applications offer the potential advantages of lower cost, higher density, and lower power dissipation, both of which are advantageous and which can translate into greater packing density in a supercomputer. This higher density allows the CPU to be physically smaller, which means faster interconnect paths and a faster overall computer, or it means more CPUs in the same machine, space to provide a more powerful system. However, functional unit design in logic devices such as VLSI gate arrays has been difficult due to the fact that the cumulative delay in propagating a signal through the device often is greater than that in an equivalent design implemented in medium scale logic. As a result, this makes them unacceptable for replacing medium scale logic in supercomputer functional units. Also, VLSI gate arrays which use sequential logic, i.e., which have latches on the chips, have problems due to transition time skew in attempting to run at supercomputer clock frequencies.