The present invention relates to the field of massively parallel, spatially organized computation.
The field of massively parallel, spatially organized computation encompasses computations involving large sets of data items that are naturally thought of as distributed in physical space. Such computations often exhibit some degree of spatial locality during each computational step. That is, the processing to be performed at each point in space depends upon only data residing nearby. For example, lattice simulations of physical systems using techniques such as finite-difference calculations and lattice-gas molecular dynamics have such spatial organization and locality. Other interesting examples include lattice simulations of physics-like dynamics, such as virtual-reality models and volume rendering. Many other computations can be embedded efficiently into a spatial-lattice format with local interactions, including many kinds of image-processing and logic emulation problems. A variety of spatial lattice computations are discussed in a paper by Norman Margolus, entitled "CAM-8: A Computer Architecture Based on Cellular Automata," Fields Institute Communications, Vol. 6, American Mathematical Society, 1996, p. 167.
A natural hardware organization for performing such computations arranges processors in space to mimic the array of discrete lattice sites being emulated, one processor per lattice site. Each processor communicates with neighboring processors using fixed or "static" connections. This kind of architecture can be both fast and massively parallel, since the wires between neighboring processors remain short regardless of the array size. Even if connections are provided between adjacent processors only (mesh interconnect), communication between processors that are near to each other involves few computational steps, and so remains fast.
A significant simplification can be achieved when all processors are identical and perform the same operation at the same time, as noted in an article by S. H. Unger, entitled "A Computer Oriented Toward Spatial Problems," Proc. IRE, 1958, p. 1744. In such an organization, a single copy of the control circuitry can be shared among all of the processors. Omitting the control circuitry from the individual processors reduces the size as well as simplifies the design of the processors. Shared control also allows communication between processors to be perfectly coordinated. That is, all processors transfer a bit in a given direction at the same time. Spatial non-uniformities in the computation are dealt with as differences in the data associated with each processor rather than as differences in the program that each processor follows. Such a shared-control lockstep processing style has been characterized as Single Instruction-stream Multiple Data-stream or SIMD. See an article by Michael J. Flynn, entitled "Some Computer Organizations and Their Effectiveness," IEEE Trans. on Computers, 1972, p. 948. Each processor in a SIMD machine may have several different functional units operating in a pipelined fashion.
Since computer size is normally fixed while problem size is variable, it is common for an array of SIMD processors to be used to perform a calculation that corresponds naturally to a larger spatial array of processors, perhaps with more dimensions than the actual physical array. This can be achieved by having each of the processors simulate the behavior of some portion of the space. Several physical simulations on the ILLIAC IV computer were done in this manner, as described in R. M. Hord's book, The ILLIAC IV: The First Supercomputer, Computer Science Press (1982). Typically, the emulated space is split into equal-sized chunks, one per processor. In problems with only nearby-neighbor interactions in an emulated spatial lattice, such a data organization minimizes interprocessor communication. This point was discussed by Stewart F. Reddaway (in the context of the SIMD mesh DAP computer) in his article entitled "Signal Processing on a Processor Array," in the 1985 Les Houches proceedings entitled Traitement Du Signal/Signal Processing, Vol. 2, Lacoume et al. (eds.), Elsevier Science 1987. If the chunks are large, then short range communication in the physical processor array can correspond to much longer range communication in the emulated lattice.
A simple way to perform a calculation that maps naturally onto a large array of processors is to have each physical processor simulate several virtual processors. This idea is discussed by Steven L. Tanimoto and Joseph J. Pfeiffer, Jr., in an article entitled "An Image Processor Based on an Array of Pipelines," IEEE Computer Society Workshop on Computer Architecture for Pattern Analysis and Image Database Management, 1981, p. 201. In the virtual processor approach, the physical hardware emulates a virtual machine of the size and type needed to directly perform the calculation. Since virtual processors are simulated both sequentially by each physical processor and in parallel by all of them, hardware designed explicitly for virtual processing can take advantage of both multiple processors and multiple pipelined functional units within each processor. In such hardware, memory and communication latency (i.e., time delay) can be absorbed into the processing pipeline. This approach was used, for example, by Tommaso Toffoli and Norman Margolus in the design of their CAM-6 virtual processor cellular automata hardware, as is discussed in their book, Cellular Automata Machines, MIT Press (1987), p. 243.
In these early cellular automata machines, programmers could choose from among a restricted set of communication patterns within a fixed-size emulated lattice (see Toffoli and Margolus, p. 55). The more recent CAM-8 machine, described in U.S. Pat. No. 5,159,690, in the name of Norman H. Margolus, uses a simpler communication scheme, in which sheets of bits move a given amount in a given direction in the emulated lattice (which has a programmable size and shape). This shifting bit-sheet scheme is implemented as a pipelined version of traditional SIMD mesh data movement. Because of the specialization to shifting entire sheets of bits, however, only a few parameters controlling a restricted set of repeated communication patterns (as opposed to detailed clock-by-clock SIMD control information) are broadcast to the processors.
In a virtual processor architecture such as CAM-8, in which the state of the emulated spatial lattice is held in memory devices, the speed of processing is limited primarily by the memory bandwidth. Recent developments in semiconductor technology allow processing logic and DRAM memory to be placed together on a single semiconductor chip, thus making enormous memory bandwidth potentially available to virtual processor lattice computations. In this context, performance and flexibility of a mesh array of chip-scale processors may become limited by communications bandwidth between chips, and by the bandwidth of the control stream coming into the chips. A uniform SIMD communication architecture (like that of CAM-8) is not appropriate in this context, since a uniform array of SIMD processing nodes on each chip would make very uneven and inefficient use of inter-chip communication resources: nodes along an edge of the array on one chip would either all need to communicate off-chip simultaneously, or all need no communication simultaneously. Furthermore, a fixed virtual machine model architecture (like that of CAM-8) gives up much of the flexibility of a more general SIMD architecture. For flexible fine-grained control, a high control bandwidth is needed.
To achieve maximum memory bandwidth, on-chip DRAM must be used in a constrained fashion. For example, in a given block of DRAM, once any bit in a given DRAM row is accessed, bandwidth may be wasted if all of the bits of that row are not used before moving on to another row. Similarly, if memory rows are accessed as a sequence of memory words, then all of the bits in entire words may also need to be used together. These kinds of memory granularity constraints must be efficiently dealt with. Temporarily storing data that are read before they are needed, or that can't be written back to the right block of memory yet, wastes the bandwidth of the temporary storage memories, and wastes the space taken up by these extra memories. Not having data available at the moment they are needed wastes processing and communications resources.