1. Technical Field
The present invention relates generally to circuits for parallel computation. Specifically, the present invention provides a mesh topology for computing time- and wire-length-optimal cyclic segmented parallel prefix operations.
2. Description of the Related Art
Parallel prefix circuits have evolved as a generalization of efficient algorithms for binary arithmetic. Ladner and Fischer introduced parallel prefix computations as a class of parallel algorithms. Ladner, R. E. et al. “Parallel Prefix Computation,” J. of the ACM, 27(4):831-838, October 1980. See also Pippenger, N. “The Complexity of Computations by Networks,” IBM J. of Research and Development, 31(2):235-243, March 1987; Blelloch, G. E. “Scans as Primitive Parallel Operations,” IEEE Trans. On Computers, C-38(11):1526-1538, November 1989; Blelloch, G. E. “Prefix Sums and their Applications,” Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa. 15213, November 1990; Leighton, F. T. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann, 1992; and Cormen, Leiserson, and Rivest. Introduction to Algorithms, MIT Press 1990. Parallel prefix circuits were implemented in Thinking Machine's CM-5 supercomputer. See Leiserson, C. E. et al. “The Network Architecture of the Connection Machine CM-5,” J. of Parallel and Distributed Computing, 33(2):145-158, March 1996; U.S. Pat. No. 5,333,268 (DOUGLAS et al.) 1994-07. The Ultrascalar processor is based on the observation that cyclic segmented parallel prefix circuits can implement all the tasks of a typical superscalar processor, including register renaming, wake-up, scheduling, committing, etc., in an orderly, principled fashion. See Henry, D. S. et al. “Cyclic Segmented Prefix Circuits,” Ultrascalar Memo 1, Yale University, November 1998; Henry, D. S. et al. “The Ultrascalar Processor—An Asymptotically Scalable Superscalar Microarchitecture,” in 20th Anniversary Conference on Advanced Research in VLSI, pp. 256-278, Atlanta, Ga., March 1999; Henry, D. S. et al. “Circuits for Wide-Window Superscalar Processors,” in 27th Int'l Symposium on Computer Architecture, pp. 236-247, Vancouver, BC, June 2000; U.S. Pat. No. 6,609,189 (KUSZMAUL et al.) 2003-08. Parallel prefix circuits have also been applied to load/store disambiguation, although under the name scan circuit. See U.S. Pat. No. 6,038,657 (FAVOR et al.) 2000-03.
Much of the appeal of parallel prefix computations stems from the fact that they can be implemented as a tree structure in VLSI with logarithmic complexity. Traditionally, complexity theory accounts for the number of nodes in the tree-structured circuit rather than the length of the wires. With increasing clock speeds, the lengths of the wires begin to dominate the critical path length, however.
What is needed, therefore, is a circuit topology for computing a cyclic segmented parallel prefix operation that is time-optimal as well as being optimal in terms of wire lengths and propagation delays. The present invention provides a solution to these and other problems, and offers other advantages over previous solutions.