Today's most advanced computers are used to model physical systems, such as, for example, the folding of a protein or the circulation of the global climate, but they are also physical systems themselves. The demands of high-performance computing have driven the frontiers of device physics from vacuum tubes to semiconductor heterostructures. Between these simulated and physical realities lie many layers of abstraction: materials are assembled into devices, devices into circuits, circuits into boards, boards into cases, cases into racks, and racks into systems, and, in corresponding layers of software, applications are implemented in algorithms, written in a high-level language, compiled into microcode, scheduled by an operating system, and then executed by processors.
Most computer science programming models hide the underlying physical reality of computation, and the corresponding layers of software serve to insulate programs and programmers from knowledge of the physical construction of the computer. This division of labor is now being challenged by the growing complexity of computing systems. While device performance has been improving exponentially for decades and has a firm future roadmap [Paul S. Peercy, “The Drive to Miniaturization”, Nature (406), pp. 1023-26 (2000)], this has not been true for software. Rather, cost overruns, shipping delays, and bugs have been recurring features of development efforts ranging from taping out chips to delivering operating systems. Along with programmer productivity, system scaling obstacles include interconnect bottlenecks and prohibitive power requirements.
As information technologies scale down in device size and up in system complexity, their computational and physical descriptions converge as the number of information-bearing degrees of freedom becomes comparable to the number of physical ones. It is already possible to store data in atomic nuclei and to use electron bonds as logical gates [N. Gershenfeld and I. Chuang, “Bulk Spin Resonance Quantum Computation”, Science (275), pp. 350-356 (1997)]. In such a computer, the information-bearing degrees of freedom are the same as the physical ones, and it is no longer feasible to account for them independently. The universe executes in linear time, independent of its size. A scalable computer architecture must similarly reflect the scaling of its contents. An explicit description of the spatial distribution, propagation, and interaction of information in a computer program offers portability across device technologies (which must satisfy the same physical laws), scalability across machine sizes (because physical dynamics are inherently parallel), and simplification of fabrication (since causality implies locality).
The performance of a computer is limited by the bandwidth and latency of the connection between where data is stored and where it is processed. Early computers were far more limited by speed and availability of processing and memory than the performance of the connections between them. Von Neumann or Harvard-style computer architectures, where for each cycle data is transmitted to and manipulated in a central processing unit, are well suited for computers built from slow and expensive processing elements (i.e. vacuum tubes) and comparatively fast and cheap communication (wires). However, faster modern building blocks (smaller transistors, improved logic families, and other emerging technologies) have outpaced the rate that data can be fetched from memory. The operating speeds of many modern computers are beyond even the relativistic limits for data to be retrieved from an arbitrary location in a single cycle. In modern computers, it can take hundreds or even thousands of cycles to fetch a piece of data. There are a wide variety of techniques that have been developed to anticipate what data will be needed and load it ahead of time (pipelining, caching, instruction reordering, branch prediction, speculative execution, etc.), but the availability and behavior of these features can vary widely from processor to processor as can their effectiveness with different program behaviors. Although the Von Neumann abstraction is a familiar model of computation, in order to write software that takes advantage of the aggressive performance possible with modern (and future) technologies, fundamentally different models of computation will be needed, as well as computer architectures that can efficiently run them.
Physics, above the atomic level, is inherently local, and computation, like every other process, relies on physics. Thus, programming models that assume non-local processes, such as data buses, random access memory, and global clocking, must be implemented at a slow enough speed to allow local interactions to simulate the non-local effects that are assumed. Since such models do not take physical locality into account, even local effects are limited to the speed of the false non-local effects, by a global clock that regulates all operations.
A second physical attribute of information is causality: there is a finite propagation velocity. The length of the wires in a computer introduces a distance and hence time scale to its operation. This is manifestly violated in the Parallel Random Access Machine (PRAM) model for parallel computing [R. M. Karp, M. Luby, F. Meyer auf der Heide, “Efficient PRAM Simulation on a Distributed Memory Machine”, Algorithmica (16), pp. 517-542 (1996)], which assumes instantaneous access from any processor to any memory location. If instead the only interconnect is between neighboring cells, then the information velocity is on the order of the gate propagation delay over the gate size, which can be picoseconds per micron [M. Sokolich, A. R. Kramer, Y. K. Boegeman, and R. R. Martinez, “Demonstration of Sub-5 ps CML Ring Oscillator Gate Delay with Reduced Parasitic AlInAs/InGaAs HBT”, IEEE Electron Device Letters (22), pp. 309-311 (2001)]. This is about 100 times slower than the speed of light, or nanoseconds to cross millimeter distances on a chip, which is comparable to the speed at which conventional chips are clocked. In return for using nearest-neighbor communication the interconnect becomes a programmable part of the computation rather than fixed wiring, and scaling is based on the cell size rather than the chip size.
In computing today, many observers agree that there is a practical physical speed limit for the venerable von Neumann model (see, e.g., Ronny Ronen, Avi Mendelson, Konrad Lai, Shih L. Lu, Fred Pollack, and John P. Shen, “Coming challenges in microarchitecture and architecture”, Proceedings of the IEEE, 89(3):325-340, 2001), and that the bulk of future speed increases will derive from parallelism in some form. Chipmakers are currently working to pack as many processors as they can into one box to achieve this parallelism, but in doing so, they are moving even further from the locality that is necessary for a direct implementation as physics. At the other end of the abstraction spectrum, while sequential programming models can be generalized to use multiple parallel threads, such models are often clumsy and do not reflect the physical location of the threads relative to each other or memory.
In addition, research has long suggested that asynchronous (or “self-timed”) devices consume less power and dissipate less heat than typical clocked devices [Tony Werner and Venkatesh Akella, “Asynchronous processor survey”, Computer, 30(11):67-76, 1997]. However, traditional microarchitectures require significant book-keeping overhead to synchronize various functional blocks, due to the nature of their instructions, which must be executed in sequence. Most asynchronous designs to present have derived their performance benefits from clever pipelining and power distribution rather than true asynchrony—known as “globally asynchronous, locally synchronous” design—and often this is not enough to offset the overhead [D. Geer, “Is it time for clockless chips? [asynchronous processor chips]”, Computer, 38(3):18-21, March 2005].
These shortcomings are accepted because of the tremendous body of existing code written in sequential fashion, which is expected to run on the latest hardware. However, by removing the assumption of backwards compatibility, there is an opportunity to create a new, disruptive programming model that is more efficient to physically implement. In particular, such a model could scale favorably to an arbitrary number of parallel elements, to larger problem sizes, and to faster, smaller process technologies. Potentially, this may have eventual impact across the computing industry, particularly in high-performance computing. In addition, it could be an enabling technology for the Singularity (see Ray Kurzweil, “The Singularity Is Near: When Humans Transcend Biology”, Viking Adult, September 2005).
Information in physics is an extensive quantity. Like mass, it scales with the system size. For a computer to do the same, it must be uniform, unlike the inhomogeneous combinations of processors, memory, storage, and communications that are the norm today. For this reason, cellular architectures have long been attractive as a model for computation [J. von Neumann, “Theory of Self-Reproducing Automata”, edited by A. W. Burks, Univ. of Illinois Press (Urbana, 1966)], and more recently for its implementation [M. Sipper, “The Emergence of Cellular Computing”, Computer (32), pp. 18-26 (1999)]. “Cellular Automata” was originally a discrete model in which space, time, and states were discretized, and update rules were carefully designed for studying complex phenomena [Neil Gershenfeld (1999), “The Nature of Mathematical Modeling”, Cambridge, UK: Cambridge University Press]. Cellular automata were found to be quite successful in modeling physical interactions governed by differential equations in a continuum limit, such as, for example, lattice gases for hydrodynamics [U.S. Pat. No. 6,760,032; U. Frisch, B. Hasslacher, and Y. Pomeau, “Lattice-Gas Automata for the Navier-Stokes Equation”, Phys. Rev. Lett. (56), pp. 1505-1508 (1986)] and spin dynamics [E. Domany and W. Kinzel, “Equivalence of Cellular Automata to Ising Models and Directed Percolation”, Phys. Rev. Lett. (53), pp. 311-314 (1984)]. Because of this great potential of computing as a physical system, cellular automata present a practical architecture for computation [N. Margolus, “Physics-Like Models of Computation”, Physica D (10), pp. 81-95 (1984)].
Relevant research in the 1970s demonstrated that universal Boolean logic could be implemented in cellular automata with one-bit states and just three local rules [R. E. Banks, “Information Processing and Transmission in Cellular Automata”, Ph.D. thesis, MIT (1971)]. The Banks Boolean cellular automata has only three rules, acting in 2D on one-bit states with 4 rectangular neighbors. The simplicity in primitive functioning unit, however, led to complexity in the implementation of wires and gates. In such a system, the logic functions are distributed, requiring many cells to realize them. The generality of a cellular automata's rule table allows many other behaviors to be modeled, such as hydrodynamics or graphics. Many more variants of cellular automata models/applications [see, e.g., U.S. Pat. No. 6,910,057] and hardware implementations [see, e.g., U.S. Pat. No. 7,509,479; U.S. Pat. No. 5,243,238] have been proposed. All of these implementations are based on Boolean logic.
If the goal is just computation, then this can be implemented more compactly in “logic automata” in which every cell can contain a logic gate and store its state, locally providing the interactions needed for computational universality. Logic automata are a subset of cellular automata [N. Gershenfeld, The Nature of Mathematical Modeling, Cambridge University Press, 1999] and quantize space and time with distributed cells connected locally, each performing a basic logic operation. Logic automata are therefore scalable, universal for digital computation [R. E. Banks, Information Processing and Transmission in Cellular Automata, Ph.D. thesis, Massachusetts Institute of Technology, 1971], and reflect the nature of many complex physical and biological systems [D. A. Dalrymple, N. Gershenfeld, and K. Chen, “Asynchronous logic automata,” Proceedings of AUTOMATA 2008 (14th International Workshop on Cellular Automata), pp. 313-322, June 2008; L. O. Chua, “CA belongs to CNN,” invited talk at AUTOMATA 2008 (14th International Workshop on Cellular Automata), June 2008]. Logic automata form a family of computer architectures that expose a cartoon version of physics that is easy for a programmer to work with but maintains the underlying physical relationship between the size of logic elements, their computation rates, and signal travel speeds. This allows programmers to work with abstractions that will have well defined behavior for both correctness and performance, regardless of which underlying technology is used to fabricate it.
The history begins with the cellular automata of von Neumann [John von Neumann, “Theory of Self-Reproducing Automata”, University of Illinois Press, 1966], designed to explore the theory of self-replicating machines in a mathematical way (though never finished). This was some time after he completed the architecture for the EDVAC project [John von Neumann, “First Draft of a Report on the EDVAC”, IEEE Annals of the History of Computing, 15(4):27-75, 1993], which has come to be known as “the von Neumann architecture.” Many papers since then can be found examining (mostly 2-state) cellular automata, and there are a few directions to prove simple cellular automata universality—Alvy Ray Smith's [Alvay Ray Smith, “Cellular Automata Theory”, PhD thesis, Stanford University, 1970], E. Roger Banks' [Edwin Roger Banks, “Cellular Automata”, Technical Report AIM-198, MIT, June 1970], and Matthew Cook's more recent Rule110 construction [Matthew Cook, “Universality in elementary cellular automata”, Complex Systems, 15(1), 2004]. However, while interesting from the point of view of computability theory, classical cellular automata clearly over-constrain algorithms to beyond the point of practicality, except in a small class of problems related to physical simulation (for instance, see Uriel Frisch, Dominique d'Humieres, Brasl Hasslacher, Pierre Lallemand, Yves Pomeau, and Jean-Pierre Rivet, “Lattice gas hydrodynamics in two and three dimensions”, pages 77-135, Addison-Wesley, 1990).
An earlier model of cellular automaton was known as the “Logic CA” [unpublished]. The Logic CA has ties to previous work [e.g., John von Neumann, “Theory of Self-Reproducing Automata”, University of Illinois Press, 1966], particularly if familiar with these types of constructions. A typical cellular automaton is an indefinite lattice (usually 2-dimensional) of “cells”, each of which has a state variable that takes one of a finite number of values at every discrete time step (0, 1, 2, . . . ), and a transition function which takes as arguments the states of a cell's nearest neighbors and itself at some time step, and returns the state of that cell in the next time step. The number of neighbors is typically 4 or 8 in 2-dimensional cellular automaton, depending on the inclusion of diagonal neighbors.
The Logic CA consists of cells with 8 neighbors and 9 bits of state. The state bits are divided into 8 configuration bits and 1 dynamic state bit. The configuration bits are further divided into 2 gate bits that choose among the four allowed Boolean functions (AND, OR, XOR, NAND) and 6 input bits that choose among the 36 possible pairs of (potentially identical) inputs chosen from the 8 neighbors (½·8·(8−1)+8). At each time step, a cell examines the dynamic state bit of its selected inputs, performs the selected Boolean operation on these inputs, and sets its own dynamic state to the result.
Mathematically, an instance of the Logic CA can be described as a series of global states St (tεN0), each composed of local states st,(i,j)ε{0,1}(i,jεZ), and a set of constant configuration elements
                                          c                          (                              i                ,                j                            )                                ∈          C                =                ⁢                  (                                    {                              AND                ,                OR                ,                XOR                ,                NAND                            }                        ×                          (                                                                    {                                                                  -                        1                                            ,                      0                      ,                      1                                        }                                    2                                -                                                      {                                          (                                              0                        ,                        0                                            }                                        )                                    2                                            )                                                              =                ⁢                              {                          AND              ,              OR              ,              XOR              ,              NAND                        }                    ×                                                ⁢                              {                                          (                                  1                  ,                  0                                )                            ,                              (                                  1                  ,                  1                                )                            ,                              (                                  0                  ,                  1                                )                            ,                              (                                                      -                    1                                    ,                  1                                )                            ,                              (                                                      -                    1                                    ,                  0                                )                            ,                              (                                                      -                    1                                    ,                                      -                    1                                                  )                            ,                              (                                  0                  ,                                      -                    1                                                  )                            ,                              (                                  1                  ,                                      -                    1                                                  )                                      }                    ×                                                ⁢                  {                                    (                              1                ,                0                            )                        ,                          (                              1                ,                1                            )                        ,                          (                              0                ,                1                            )                        ,                          (                                                -                  1                                ,                1                            )                        ,                          (                                                -                  1                                ,                0                            )                        ,                          (                                                -                  1                                ,                                  -                  1                                            )                        ,                          (                              0                ,                                  -                  1                                            )                        ,                          (                              1                ,                                  -                  1                                            )                                }                    (note that there is a bijection between C and {0,1}8, 8 bits) such that
      s                  t        +        1            ,      i      ,      j        =      {                                                      if              ⁢                                                          ⁢                                                (                                      c                                          (                                              i                        ,                        j                                            )                                                        )                                1                                      =            AND                                                              s                              t                ,                                                      (                                          i                      ,                      j                                        )                                    +                                                            (                                              c                                                  (                                                      i                            ,                            j                                                    )                                                                    )                                        2                                                                        ⋀                          s                              t                ,                                                      (                                          i                      ,                      j                                        )                                    +                                                            (                                              c                                                  (                                                      i                            ,                            j                                                    )                                                                    )                                        3                                                                                                                                      if              ⁢                                                          ⁢                                                (                                      c                                          (                                              i                        ,                        j                                            )                                                        )                                1                                      =            OR                                                              s                              t                ,                                                      (                                          i                      ,                      j                                        )                                    +                                                            (                                              c                                                  (                                                      i                            ,                            j                                                    )                                                                    )                                        2                                                                        ⋁                          s                              t                ,                                                      (                                          i                      ,                      j                                        )                                    +                                                            (                                              c                                                  (                                                      i                            ,                            j                                                    )                                                                    )                                        3                                                                                                                                      if              ⁢                                                          ⁢                                                (                                      c                                          (                                              i                        ,                        j                                            )                                                        )                                1                                      =            XOR                                                              s                              t                ,                                                      (                                          i                      ,                      j                                        )                                    +                                                            (                                              c                                                  (                                                      i                            ,                            j                                                    )                                                                    )                                        2                                                                        ⊕                          s                              t                ,                                                      (                                          i                      ,                      j                                        )                                    +                                                            (                                              c                                                  (                                                      i                            ,                            j                                                    )                                                                    )                                        3                                                                                                                                      if              ⁢                                                          ⁢                                                (                                      c                                          (                                              i                        ,                        j                                            )                                                        )                                1                                      =            NAND                                                              s                              t                ,                                                      (                                          i                      ,                      j                                        )                                    +                                                            (                                              c                                                  (                                                      i                            ,                            j                                                    )                                                                    )                                        2                                                                        ⋀                          s                              t                ,                                                      (                                          i                      ,                      j                                        )                                    +                                                            (                                              c                                                  (                                                      i                            ,                            j                                                    )                                                                    )                                        3                                                                                          
FIG. 1 depicts Logic CA gates AND 110, OR 120, XOR 130, and NAND 140. In FIG. 1, pixels 150 indicate the selected input directions for a given cell, central glyphs 160, 165, 170, 175 indicates the selected Boolean functions, and the color of the glyph indicates the dynamic state (e.g. blue=0, red=1).
Although the Logic CA is useful for many applications, two major problems were identified with it: lack of reversible/adiabatic logic and the requirement for a global clock. The system does not employ conservative logic [Edward Fredkin and Tommaso Toffoli, “Conservative logic”, International Journal of Theoretical Physics, 21(3):219-253, April 1982] or adiabatic computing [J. S. Denker, “A review of adiabatic computing”, In Low Power Electronics, Digest of Technical Papers, IEEE Symposium, pages 94-97, 1994], which is necessary to truly represent physical resources. The clock is global, which is clearly a non-local effect, and cellular automata are not fundamentally required to have a global clock in order to perform universal computation [Kenichi Morita and Katsunobu Imai, “Logical Universality and Self-Reproduction in Reversible Cellular Automata”, ICES '96: Proceedings of the First International Conference on Evolvable Systems, pages 152-166, London, UK, 1996, Springer-Verlag; Jia Lee, Ferdinand Peper, Susumu Adachi, Kenichi Morita, and Shinro Mashiko, “Reversible Computation in Asynchronous Cellular Automata”, UMC '02: Proceedings of the Third International Conference on Unconventional Models of Computation, pages 220-229, London, UK, 2002, Springer-Verlag].
Another related sub-field is that of field-programmable gate arrays (FPGAs). Gate arrays have evolved over time from sum-product networks such as Shoup's [R. G. Shoup, “Programmable Cellular Logic Arrays”, PhD thesis, Carnegie Mellon University, 1970] and other acyclic, memoryless structures such as Minnick's [R. C. Minnick, “Cutpoint Cellular Logic”, IEEE Transactions on Electronic Computers, EC13(6):685-698, December 1964] to the complex, non-local constructions of today's commercial offerings, yet skipping over synchronous and sequential, but simplified local-effect cells.
The tradition of parallel programming languages, from Occam [A. W. Roscoe and C. A. R. Hoare, “The laws of Occam programming”, Theoretical Computer Science, 60(2):177-229, September 1988] to Erlang [Joe Armstrong, Robert Virding, Claes Wikstrom, and Mike Williams, “Concurrent Programming in Erlang”, Second Edition, Prentice-Hall, 1996] to Fortress [Guy L. Steele, Eric Allen, David Chase, Victor Luchangco, Jan-Willem Maessen, Sukyoung Ryu, and Sam Tobin-Hochstadt, “The Fortress Language Specification”, Technical report, Sun Microsystems, March 2007] is also of interest. Although they are designed for clusters of standard machines (possibly with multiple processors sharing access to a single, separate memory), they introduce work distribution techniques and programming language ideas that are likely to prove useful in the practical application of the present invention.
The Connection Machine [W. Daniel Hillis, “The Connection Machine”, MIT Press, Cambridge, Mass., 1985] was designed with a similar motivation—merging processing and memory into a homogeneous substrate—but as the name indicates, it also included many non-local connections (“In an abstract sense, the Connection Machine is a universal cellular automaton with an additional mechanism added for non-local communication. In other words, the Connection Machine hardware hides the details.”) However, the implementation of Lisp on the Connection Machine [G. L. Steele and W. D. Hillis, “Connection Machine Lisp: fine-grained parallel symbolic processing”, ACM Press, 1986] does introduces concepts such as xectors, which are likely to be useful in the implementation of functional programming languages in a cellular automaton-based architecture.