Technical Field
This invention relates to computer systems and digital signal processors (DSP), and more particularly, to multi-processor systems.
Description of the Related Art
The need for parallel computation arises from the need to perform software tasks with increased speed. Parallel computation may accelerate the processing of multiple complex signals in applications such as telecommunications, remote sensing, radar, sonar, video, cinema, and medical imaging, and the like. Parallel computation also may provide greater computational throughput and may overcome certain limitations of the serial computation approach. The capability of computational systems may be compared by metrics of performance, usually for a set of specified test algorithms. The main performance metric of interest has been calculations per second. For battery-powered or thermally constrained equipment, however, the metric of calculations per second divided by the energy consumed may be preferred.
A parallel computer or signal processor, considered in the abstract, may be composed of multiple processors, multiple memories, and one or more interconnecting communication networks. These components have been combined in many different topologies, described in the literature on parallel-processor computing, also known as multiprocessing. All of these components have input to output latency due to internal delays that are related to electrical charge and discharge of conductor traces (wires) and transmission line effects, one of which is that no signal may travel faster than the speed of light. Consequently, smaller components generally exhibit lower latency than physically larger ones, and systems with fewer components will exhibit lower average latency than systems with more computational components. Although more components in the system may increase average latency, there are techniques of arranging computations to take advantage of low-latency communication between neighboring elements, such as pipeline and systolic processing.
In recent years advances in integrated circuit manufacturing have made it possible to fabricate increasingly miniaturized components of parallel computers. With miniaturization the components operate at lower power consumption, higher speed, and lower latency. Consequently hundreds of processing elements (PEs) and supporting memories (SM) along with a high bandwidth interconnection network (IN) may be fabricated on a single multi-processor IC chip. From such multiprocessor chips a wide variety of parallel computer systems can be built—ranging from small systems using part of a chip to multichip systems that include high speed and high capacity memory chips.
Increasingly, digital electronic systems, such as computers, digital signal processors (DSP), and systems embedded in enclosing equipment, utilize one or more multi-processor arrays (MPAs). An MPA may be loosely defined as a plurality of processing elements (PEs), supporting memory (SM), and a high bandwidth interconnection network (IN). As used herein, the term “processing element” refers to a processor or CPU (central processing unit), microprocessor, or a processor core. The word array in MPA is used in its broadest sense to mean a plurality of computational units (each containing processing and memory resources) interconnected by a network with connections available in one, two, three, or more dimensions, including circular dimensions (loops or rings). Note that a higher dimensioned MPA can be mapped onto fabrication media with fewer dimensions. For example, an MPA with the shape of a four dimensional (4D) hypercube can be mapped onto a 3D stack of silicon integrated circuit (IC) chips, or onto a single 2D chip, or even a 1D line of computational units. Also low dimensional MPAs can be mapped to higher dimensional media. For example, a 1D line of computation units can be laid out in a serpentine shape onto the 2D plane of an IC chip, or coiled into a 3D stack of chips. An MPA may contain multiple types of computational units and interspersed arrangements of processors and memory. Also included in the broad sense of an MPA is a hierarchy or nested arrangement of MPAs, especially an MPA composed of interconnected IC chips where the IC chips contain one or more MPAs which may also have deeper hierarchal structure.
In general, the memory for computers and digital signal processors (DSPs) is organized in a hierarchy with fast memory at the top and slower but higher capacity memory at each step down the hierarchy. In an MPA, supporting memories at the top of the hierarchy are located nearby each PE. Each supporting memory may be specialized to hold only instructions or only data. Supporting memory for a particular PE may be private to that PE or shared with other PEs.
Further down the memory hierarchy there may be a larger shared memory typically composed of semiconductor synchronous dynamic random access memory (SDRAM) with a bit capacity many times larger than that of the supporting memory adjacent to each PE. Further down the memory hierarchy are flash memory, magnetic disks, and optical disks.
As described above, a multiprocessor array (MPA) may include an array of processing elements (PEs), supporting memories (SMs), and a primary interconnection network (PIN or simply IN) that supports high bandwidth data communication among the PEs and/or memories. Various embodiments of MPAs are illustrated in FIGS. 1 and 2, described below. Generally, a PE has registers to buffer input data and output data, an instruction processing unit (IPU), and means to perform arithmetic and logic functions on the data, plus a number of switches and ports to communicate with other parts of a system. The IPU fetches instructions from memory, decodes them, and sets appropriate control signals to move data in and out of the PE and to perform arithmetic and logic functions on the data. PEs suitable for large MPAs are generally more energy efficient than general purpose processors (GPP), simply because of the large number of PEs per IC chip that contains a large MPA.
As used herein, the term MPA covers both relatively homogeneous arrays of processors, as well as heterogeneous collections of general purpose, and specialized processors that are integrated on so-called “platform IC” chips. Platform IC chips may contain from a few to many processors, typically interconnected with shared memory and perhaps an on-chip network. There may or may not be a difference between a MPA and a “platform IC” chip. However, a “platform IC” chip may be marketed to address specific technical requirements in a specific vertical market.
An interconnection network (IN) may be either fully-connected or switched. In a fully-connected network, all input ports are hardwired to all output ports. However, the number of wires in fully-connected network increases as N2/2 where N is the number of ports, and thus a fully-connected network quickly becomes impractical for even medium sized systems.
A switched network is composed of links and switching nodes. The links may comprise wiring, transmission lines, waveguides (including optical waveguides), or wireless receiver-transmitter pairs. Switching nodes may be as simple as a connection to a bus during a time window, or as complex as a crossbar with many ports and buffer queues. A single-stage network is one where all the input ports and output ports reside on one large switching node. A multi-stage network is one in which a data-move traverses a first switching node, a first link, a second switching node, and possibly more link-node pairs to get to an output port. For example, the traditional wireline telephone system is a multistage network.
Interconnection networks for parallel computers vary widely in size, bandwidth, and method of control. If the network provides a data-path or circuit from input to output and leaves it alone until requested to tear it down, then it may be said to be “circuit-switched.” If the network provides a path only long enough to deliver a packet of data from input to output, then it may be said to be “packet switched.” Control methods vary from completely deterministic (which may be achieved by programming every step synchronous to a master clock) to completely reactionary (which may be achieved by responding asynchronously to data-move requests at the port inputs).
For a single stage network the request/grant protocol is a common way to control the switches. A request signal is presented to an input port and compared to request signals from all other input ports in a contention detection circuit. If there is no contention the IN responds with a grant signal. The port sends an address and the IN sets switches to connect input with output. When contention is detected then an arbitration circuit (or “arbiter”) will decide which one of the requesting ports gets a grant signal. Ports without a grant signal will have to wait. Ports that did not succeed in one cycle may try again in subsequent cycles. Various priority/rotation schemes are used in the arbiter to ensure that every port gets at least some service.
For a multi-stage network a particular protocol called “wormhole routing” may be used. Wormhole routing is based on the idea that a message can be formed into a series or string of words with a header for navigation, a body to carry the payload data, and a tail to close down the path. The message “worms” its way through a network as follows. Presume a network laid out as a Cartesian grid; and that a switching node and a memory is located at each junction of the grid. The header may contain a string of simple steering directions (such as go-straight-ahead, turn-left, turn-right, or connect-to-local memory), which indicate where the worm should go at each node it encounters in the network. These steering directions are so simple that a node can decode them and set switches very rapidly with little circuitry. The path, or “hole,” set up by the header allows the passage of the payload data, the “body,” until a codeword “tail” is encountered which causes the node to close the hole after it. Closing the path may free up links and nodes for other paths to be created by the same wormhole routing protocol.
The bandwidth of an IN may be defined as the number of successful data moves that occur per unit time, averaged over long intervals. The bandwidth of a switched IN is difficult to estimate in any analytic way because it depends on many factors in the details of the IN and in the characteristics of data-move requests put to it. When the request rate is low the chances for conflict for resources is low and almost 100% of the requests are successful. Measurements and simulations show that, as the rate of data-move requests increases, the fraction of data-moves that succeed decreases from 100%. Eventually the number of successful data-moves per second will saturate or peak and the maximum is taken as the IN's bandwidth.
An MPA may be programmed with software to perform specific functions for an application. There are two main types of software—application software, and development tools. Application software is the source text, intermediate forms, and a final binary image that is loaded into MPA memory for execution by PEs in the MPA. Development tools are software programs to design and test application software for a targeted hardware, such as language compilers, linkers, concurrent task definition aids, communication pathway layout aids, physical design automation, simulators, and debuggers. Development tool software may or may not run on (be executable by) the target hardware of the application software.