This invention relates to array processors in general and more particularly to a cellular array processor having a highly parallel, highly regular design with a single instruction multiple data (SIMD) architecture.
In the present state of the technology associative processors or array processors have been widely investigated. Essentially, such a processor constitutes a plurality of individual processing cells arranged in a matrix. This combination of processing cells is able to be programmed to enable the solution of complex mathematical problems. There have been many excellent articles in the literature which relate to various forms of such processors.
For example reference is made to an article which appeared in the June 1985 issue of IEEE COMPUTER. This article is entitled "Multiprocessing Technology" by Chuan-Lian Wu. A further article appeared in HIGH TECHNOLOGY, July 1985 on pages 20-28 entitled "Parallel Processing Gets Down to Business" by E. J. Lerner.
Such processors while capable of performing and solving complicated problems are attendant with many different characteristics and requirements. Many present processors employ the single instruction, single data architecture. This particular architecture is well suited for regular applications. It is inherently highly structured and can be configured into different sizes without much additional cost.
In regard to such a structure, the SIMD architecture is highly regular, the data elements are processed in large blocks, the volume of the input data is very large and the desired response time may be very short and critical as the computation requirements per datum are relatively uniform. Within SIMD machines there are both array processors and cellular array processors. Array processors generally have a high performance pipeline of arithmetic elements, little parallelism, and operate upon an array of data.
A cellular array processor is highly parallel having an array of processors each operating upon an array of data. This multiplicity of processors benefits very well from highly structured VLSI design, especially as extended by fault tolerance techniques to be described.
As indicated, the prior art has provided numerous types of array processors. In any event, there are only a few cellular array processors. One such device is manufactured by Goodyear and designated as the MPP. See an article entitled "Design Of A Massively Parallel Processor" which appeared in the IEEE COMPUTER SOCIETY, 1980, pages 80 to 85 by K. E. Butcher. This article describes a cellular array processor.
Such processors operate on storing data streams and processing data streams. The above described processesor is designed to operate on a bit serial, word parallel fashion. Each word is stored one bit after another through a succession of memory locations. In any event, this provides for increased operating time while presenting a number of problems in construction. Hence the processor to be described in this application operates in a bit parallel, word parallel manner and, therefore, has more flexibility in memory addressing and allows one to program the same in a simpler and efficient manner.
According to this invention, an array chip is provided which will be utilized as a building block in a highly parallel processor which is of the cellular array type. The processor according to the invention employs a single instruction multiple data (SIMD) architecture. In such a structure one requires a multiplicity of arithemetic logic units and memory to operate parallel on multiple data streams from a single instruction stream. Such a system requires a large number of identical processing elements.
These processing elements must be highly interconnected so that they may flexibly pass data between one another. In addition it is imperative that a high speed means of moving data into and out of the machine be provided to enable the processing elements to be fully and efficiently employed.
Thus as will be shown, the architecture utilizes the processing elements in a most efficient manner to therefore prevent the processing elements from being idle for long periods of time.
As one will ascertain, it is therefore one object of the present invention to maximize the number of processing elements that may be integrated into a single integrated circuit.
It is a further object to maximize the performance of each of these processing elements.
It is still another object to provide local memory for the processing elements on the same chip so that no delays are encountered in going off the chip to acquire the data.
As will be explained, high speed input output structure is provided to allow one to move new data into and out of the on-board memory. As will be seen, a typical system employs twenty 16-bit processor cells on a single array chip with the chip having 256K bits of DRAM (Dynamic Random Access Memory) available to the user. The number of cells and the amount of DRAM are relatively arbitrary.
The Preferred Embodiment requires at least eighteen 16-bit processors plus two spare processors on a chip. The processors are 16-bits wide in order to maximize the performance of floating point arithmetic wherein for both single precision and double precision operation, the exponent is contained within the most significant 16 bits of the word. The structure utilizes a dynamic fault tolerance technique which provides software control of the array configuration. Any number of cells in one chip may operate together to increase word size, although the typical configuration would be sixteen 16-bit processors, eight 32-bit processors, or four 64-bit processors.
In addition, two of the processors cooperate together to generate addresses. These addresses may be used to address the on-board memory or to address data that is off the array chip. When data is being addressed from off the array chip, the main memory bus of the chip operates in a time division multiplex fashion wherein a succession of memory cycles is required to provide the data to all of the cells on board the array chip. This time division multiplexing therefore dramatically increases the time required to fetch data for all of the cells, and it would be the objective in programming this machine that the number of such memory accesses would be minimized.
The provision of two spare processing elements on the chip as means of overcoming manufacturing defects dramatically increases the number of processors that may be economically placed on a chip. This provision furthermore improves the performance and reduces the size of the system by enabling a large number of processors to be co-located on a single chip rather than being contained in multiple chips.
Since many pins coming into the chip are in common to all of the processing cells on the chip then if one had a single 32-bit processor with memory on a chip, one would need to replicate bus connections and instruction connections on each chip in order to provide the same connectivity as the present chip. Therefore, one would need at least eight chips, rather than a single chip, each having roughly 100 pins in order to accomplish the same functional operation.
It is, therefore, another object of the present apparatus to provide maximum performance by having a very larger number of very inexpensive processors each equipped with a modest amount of memory in the one kiloword region, although it is also arbitrary depending upon the current state of RAM fabrication capability. Static or dynamic RAM designs may be used. Hence it is a further object of this invention to minimize pin count, reduce power and radiated noise. In order to accomplish this, there will be described a 2:4 level converter. This 2:4 level converter enables one to reduce the device pin count and thus reduce the package size and cost.
In addition by employing this converter, one can now use a technique which will be designated as a 2/3/4 bus architecture which provides for multiple ways of signalling on as many as four individual buses. The strategy is to provide a means of passing a maximum amount of data on a minimum number of pins which pins are associated with chips that are closely located. That is, this is a technique that would be applicable on a single circuit board rather than across a multiplicity of circuit boards.
In this sense of having a single circuit board, it is as though these various chips were, in fact, on the same wafer in that a signalling scheme is devised which is not intended for general use. This signalling scheme places two data bits on a single pin. In conventional interface levels such as TTL one has either a logic 0 or a logic 1 placed on a particular pin. In the technique to be described, four logic levels as logic 00, logic 01, logic 10 and logic 11 are placed on a single pin. Essentially, there is in effect a 2-bit digital-to-analog converter that places information on a pin, and likewise there is a 2-bit analog-to-digital converter that receives the information from a pin. The D-to-A converter is designed in such a way that a minimum of power is consumed by providing a multiplicity of power pins, one for each voltage level, rather than having an analog circuit. The noise immunity of such a system is more than sufficient in a closely located environment where one is not contending with back plane noise.
The four level signals are intended for communication between like chips. For communication with dissimilar chips, a 2:4 level converter buffer chip is required. Another feature of providing four logic levels on a pin is that the noise generation is reduced since the average voltage transition is one half of the power supply rather than equal to the power supply as in a conventional CMOS chip. This furthermore reduces the power that is necessary to drive a line since for a highly capacitive line the energy consumed is proportional to the capacitance of the line times the voltage squared, and by halving the voltage swing, the power consumption is reduced.
There is a compatibility mechanism provided in the chip where a small number of 2-state buses or a large number of 4-state buses are provided. In addition the high speed input/output bus (I/O) which is narrower than the other buses is controlled in such a way that it may be used as either byte at a time for two levels or 16-bit word at a time for four levels. This enables one to trade off interface levels versus bus bandwidth, providing the number of transfers per second is the same.
The clock rate of these buses is minimized by providing four levels rather than providing half the number of bits at double the clock rate due to difficulties with clock skew between multiple chips at higher clock rates, plus the higher clock rate would have dramatically increased the power requirements to drive the bus. Additionally, the size of the driving elements would have to be much larger in order to provide the very fast response time that a double frequency clock would have required.
Thus in order to improve operation and in order to provide for a simple architecture in a cellular array processor, there are shown various techniques which are incorporated in the present specification. These techniques will be explained in detail which will enable one to provide the above described desirable features and hence improve operation as well as reduce cost.
One aspect of this invention is to provide processing elements on a single integrated circuit chip. These processing elements are controlled by software to overcome manufacturing defects to therefore cooperate together to form words of varying sizes and to replace cells that become defective during the lifetime of the processor thereby prolonging the effective life of the machine.
These cells communicate with external memory via a time division multiplex bus. The bus is 32-bits wide and each cell is connected to both the upper half and lower half of the bus. According to configuration bits that are loaded into the cell, the cell will communicate over the top or the bottom half of the bus according to the significance of the bit placed in the cells. Hence such cells will form words between 16 bits and 256 bits in the case where 20 such cells are implemented on a single chip with 4 of the cells being deemed to be spare parts.
A second technique employed will be described, and this involves combining substantial amounts of dynamic random access memory (DRAM) on board the same chip. It is a key point of the disclosure as will be explained that two of the 16-bit cells may cooperate together as an address generator so that large amounts of memory external to the array chip may be addressed and in addition so that an address may be generated on board for use by the DRAM.
A further aspect of the invention is a technique and apparatus that is integrated and employed with the multiplicity of dynamically reconfigurable 16-bit slices which will enable and disable arbitrary collections of processing cells to respond according to the data on which they are operating. The objective is to allow a collection of word sizes to be defined and then for certain of those processing elements to be enabled or disabled according to the data that they are operating on. As will be explained, this technique allows one to perform complicated functions while providing for a most efficient use of all processor cells located in the array.
A further aspect of the invention is the ability of each of the array chips to have programmed into it, at the time of manufacturing tests, the location of its defective elements. This data may be read out at system initialization time so that tests do not need to be performed in order to redetermine the location of defective elements. Furthermore, a technique is described wherein a collection of these chips, each presumably having a different collection of defective elements, may be combined together in a system with a simple means provided to read out the defect information from all of the chips.
A further aspect of the invention is the provision for testing multiple cells simultaneously in order to reduce the test time. On-chip test logic is provided so that the outputs of multiple cells may be monitored simultaneously on a common bus. As defective elements are located, they may be excluded under software control so that testing of the remaining elements may proceed simultaneously. Substantial reductions in test time may thus be obtained, reducing the cost of the chips.
A further aspect of the invention is the unique structure of the multiport RAM. A memory with two read ports and one write port is built from static memory cells where both read ports are used to read out two different locations, and are then used in concert to write into a single location.