The present invention relates to computer hardware, and more particularly, to a hardware structure for use in custom hardware accelerators used to expedite loops.
Computer programs often make use of xe2x80x9cloopsxe2x80x9d to process data. A loop consists of a network of operations that are repeatedly applied to a stream of input data to generate a stream of results. Custom integrated circuits likewise make use of such loops.
Hardware arrangements designed to accelerate the computation of loops are known to the art. In general, these hardware structures employ a plurality of function units working on different iterations of the loop to reduce the time needed to compute the loop by overlapping the computations of a number of loop iterations. The highest degree of overlap is obtained when a distinct function unit executes each operation within the body of the loop, and a new iteration is initiated on every clock cycle. In this case, there is a simple one-to-one correspondence between hardware function units and operations within the program graph as well as a simple correspondence between dataflow edges in the program graph and actual hardware datapaths. Simple one-to-one solutions are very efficient because they feature a minimal set of resources that are all busy on every cycle. Such designs, however, are often too costly. Less costly designs utilize schemes in which a plurality of function units are used to provide overlapped computations; however, the ensemble of function units only initiates a loop iteration every II cycles, where II greater than 1.
In general, one iteration of the loop generates values that are needed in subsequent computations, either in the current iteration or in a subsequent iteration. These values must be stored in some form of high-speed storage that is accessible to all of the function units that require these values. The cost of this storage represents a significant fraction of the cost of a hardware loop accelerator.
Broadly, it is the object of the present invention to provide an improved hardware accelerator architecture for accelerating loops.
It is a further object of the present invention to provide a high-speed storage system for use in hardware accelerators and the like.
These and other objects of the present invention will become apparent to those skilled in the art from the following detailed description of the invention and the accompanying drawings.
The present invention is a computational unit for use in loop computations. The computational unit includes a function unit, a plurality of phase lines, and a storage register. The computational unit is programmed to initiate one iteration of the loop every II cycles. The function unit has a result output for outputting one computational result each cycle. There is one phase line corresponding to each of the II cycles. The storage register includes a linear connected array of shift cells having a first shift cell. Each shift cell has an input port, an output port, a shift control port, and an OR gate. Each shift cell receives the value to be stored in the shift cell on the input port, the stored value being stored in response to a control signal on the shift control port. The OR gate has an output connected to the shift enable port and one input for each cycle on which that shift cell is to receive the control signal, that input being connected to the phase line corresponding to that cycle. The input port of the first shift cell is connected to the result output. A plurality of such computational units can be connected together to form a loop accelerator. The accelerator includes a cross-connect circuit for coupling at least one shift cell output of one of the computational units to an input of a function unit of another of the computational units on a selected one of the II cycles.