This relates to a method and apparatus for providing both bit-serial and word-oriented processing in a parallel computer.
Parallel computers fall into two main groups, control parallel computers and data parallel computers, each with different processor and word width requirements.
Control parallel computers achieve increased performance by taking advantage of parallelism found in the control structure of programs. The Illiac IV, Cray X-MP, BBN Butterfly and CalTech Cosmic Cube are examples of control parallel computers. In a control parallel computer, each processor executes a portion of the overall program. Consequently, each processor must have capabilities comparable to the processor of a serial computer on which the same program could be run. This means that the factors which affect processor design for a control parallel computer are very similar to those of a serial machine.
In particular, control parallel computers typically use word-oriented processors in which each processor receives as an input a plurality of bits in parallel. This unit of input is often referred to as a word and the number of bits as the word width or word length. The number of bits varies widely. In microprocessors it has ranged from four bits in the earliest microprocessors such as the Intel 4004 to thirty-two bits in the most advanced microprocessors available today. In larger processors, even larger word widths have been the norm. For example, the Illiac IV used a word width of 64 bits.
Word-oriented processors tend to be special purpose. They are generally optimized for a fixed set of instructions/operations and data types or storage formats and handle those cases very efficiently. If other storage formats or data types are desired, however, large penalties in either performance or storage efficiency result. Depending on their use, word-oriented processors may be general purpose enough to emulate the functions which are not part of their instruction set, or they may be so special purpose as to be only useful for the small set of instructions for which they were designed. For example, a Motorola 68020 processor is capable of emulating floating point instructions, while many commercial floating point chips are incapable of efficiently performing a logical OR operation. Directly performing special functions rather than emulating them with a series of logical operations makes such word-oriented ALUs, in general, less flexible.
Data parallel computers achieve increased performance by taking advantage of parallelism found in the data of a problem. The Solomon computers, the Array Processor, the STARAN, the Massively Parallel Processor, and the Connection Machine System are examples of data parallel computers. Data parallel computers consist of a single instruction engine with hundreds or thousands of data processors. Each data processor has a local memory and is connected to a communications network over which it may exchange information with other processors. The factors which affect the design of data processors in a data parallel computer are quite different from the processors of a control parallel computer for two reasons. First, the control aspects of a program on the data parallel computer may be executed by the instruction engine. This means that the data processors are not required to handle instructions or addresses, and may instead be tuned for data manipulation. Second, for data parallel problems tens of thousands of data elements may be operated on simultaneously. This implies that any parallelism which is made available can be used effectively.
Data parallel computers typically use a multitude of bit-serial processors each of which receives data one bit at a time and operates on this data to produce an output one bit at a time.
Bit-serial processors are very simple. A three input ALU that operates on single-bit quantities can only produce eight possible outputs. Therefore, an ALU operation such as an Add or a logical OR may be specified by providing the eight-bit truth tables for the particular function. This means that bit-serial processors can be implemented with minimal instruction decoding. There are no carry chains since only one bit from each operand is available on each cycle. This simplicity makes them fast, compact, and easy to implement. Since they implement all possible Boolean operations efficiently, bit-serial processors can support a wide variety of operations and data types. Bit-serial processors also use memory very efficiently because any sized word can be stored without wasting any bits.
As shown in FIG. 1A of the above-referenced U.S. Pat. No. 4,598,400, one type of bit-serial parallel computer comprises a mainframe computer 10, a microcontroller 20, and an array 30 of parallel processing integrated circuits 35. Mainframe computer 10 may be a suitably programmed commercially available general purpose computer such as a VAX (TM) computer manufactured by Digital Equipment Corp. Microcontroller 20 is an instruction sequencer of conventional design for generating a sequence of instructions that are applied to array 30 by means of a thirty-two bit parallel bus 22. Microcontroller 20 receives from array 30 a signal on line 26. This signal is a general purpose or GLOBAL signal that can be used for data output and status information. Bus 22 and line 26 are connected in parallel to each IC 35. As a result, signals from microcontroller 20 are applied simultaneously to each IC 35 in array 30 and the signal applied to microcontroller 20 on line 26 is formed by combining the signal outputs from all of ICs 35 of the array.
Array 30 contains thousands of identical ICs 35; and each IC 35 contains several identical processor/memories 36. In the embodiment disclosed in the '400 patent, it is indicated that the array may contain up to 32,768 (=2.sup.15) identical ICs 35; and each IC 35 may contain 32 (=2.sup.5) identical processor/memories 36. At the time of filing of this application for patent, arrays containing up to 4,096 (=2.sup.12) identical ICs 35 containing 16 (=2.sup.4) identical processor/memories each have been manufactured and shipped by the assignee as Connection Machine (Reg. TM) computers.
Each IC 35 contains a plurality of processor/memories that are disclosed in greater detail in FIG. 7A of the U.S. Pat. No. 4,598,400 and in FIGS. 4 and 6 of '090 application for "Massively Parallel Processor". As shown in FIG. 7A, processor/memory 36 comprises a random access memory (RAM) 250, an arithmetic logic unit (ALU) 280 and a flag controller 290. The inputs to RAM 250 include a message packet input line 122 from a communication interface unit (CIU) 180 of FIG. 6B of that patent; and the outputs from RAM 250 are lines 256, 257 to ALU 280. The ALU operates on data from three sources, two registers in the RAM and one flag input, and produces two outputs, a sum output on line 285 that is written into one of the RAM registers and a carry output on line 287 that is made available to certain registers in the flag controller and can be supplied to communications interface unit 180 via message packet output line 123.
An alternative design for the processor/memory is disclosed in the '090 application for "Massively Parallel Processor" As shown in FIGS. 4 and 6 thereof, the processors and memories are located in separate integrated circuits 334, 340 mounted on the same circuit board. In particular, each integrated circuit 334 comprises sixteen identical processors 336, a control unit 337, a router 338 and a memory interface 339. The memory interface connects the sixteen processors of an integrated circuit 334 to their memories which, illustratively, are located on sixteen separate integrated circuits 340. The router 338 connects the sixteen processors to twelve nearest neighbor routers connected in a twelve dimension hypercube.
While a properly programmed bit-serial processor is able to perform many mathematical or logic operations, it has to perform these operations one bit at a time. As a result, it is not able to take advantage of any optimized procedure that might be useful, for example, in multiplying multi-digit numbers. At the same time as noted above, word-oriented processors which can be optimized for performing certain functions are not as flexible as bit-serial processors in performing all types of arithmetic and logic operations.