This invention relates to computer structures, and in particular to a parallel processing memory chip containing single instruction, multiple data path processors.
In conventional Von Neumann computer architectures, the speed of the processor is often restricted by the bandwidth of the interconnecting data bus, which is typically 8 to 64 bits in word width. In order to increase the speed of computers restricted by such constraints, parallel computer architectures have been designed, for example, those described briefly below.
In a structure called The Connection Machine, 64K processors are used with 4K bits of memory allocated to each processor. The memory permits two read functions and a write function in one processor cycle to support three operand instructions. The Connection Machine integrated circuit chip contains 16 processors and a hypercube routing node. A high performance interconnect network is a major feature of the architecture. The peak performance of the connection machine is about 1,000 MIPS, using a 32 bit addition function as a reference. A description of The Connection Machine may be found in Scientific American article xe2x80x9cTrends in Computersxe2x80x9d, by W. Daniel Hillis, Special Issue/Vol. 1, page 24ff.
A structure referred to as the Massively Parallel Processor (MPP) constructed by Goodyear Aerospace contains several 128xc3x97128 processor planes. The MPP was designed to process Landsat images; it makes heavy use of its two dimensional grid connectivity. Processors are packaged eight to a chip.
The ICL Distributed Array Processor was designed to be an active memory module for an ICL type 29000 mainframe. Its first implementation was a 32xc3x9732 grid built from MSI TTL components. A CMOS version has since been made containing 16 processors. Each 1 bit processor consists of a full adder, a multiplexer to select data from neighbors, and three registers.
A computer MP-1 is described by MasPar Computer Corporation in preliminary product literature, the product being formed of chips containing 32 processors which will be assembled into machines with 1K-16K processors. The machine utilizes two instruction streams. Each processing element can elect to obey either of the streams, so both halves of an if-then-else statement can be concurrently followed without nesting.
NCR Corporation has produced a chip containing 6xc3x9712 serial processors which is called the Geometric Arithmetic Parallel Processor (GAPP). Each processor can communicate with its four nearest neighbors on its two dimensional grid and with a private 128 bit memory. The processing elements operate on instructions with five fields. Due to their complexity, these processing elements take up slightly more than half the chip. It has been found that yields are low and the cost is expensive.
In an article entitled xe2x80x9cBuilding a 512xc3x97512 Pixel-Planes Systemxe2x80x9d in Advanced Research in FLSIxe2x80x94Proceedings of the 1987 Stanford Conference, pages 57-71, 1987, by John Poulton et al, a pixel planes machine is described which integrates processing elements with memory. The machine was designed for computer graphics rendering. The pixel planes machine is connected to a host processor via a DMA channel. It is noted that for many operations, data transfer between the host and pixel planes machine dominate the execution time.
In the aforenoted structures, while each uses plural processors, separate memory is accessed by the processors. Locating memory on different chips than the processor elements limits the degree of integration. The data path between the memory chips and the processors limits the bandwidth available at the sense amplifiers. In contrast, in an embodiment of the present invention, one processing element per sense amplifier can be achieved, the processing elements carrying out the same instruction on all bits of a memory row in parallel. Therefore an entire memory row (e.g. word) at a time can be read and processed in a minimum time, maximizing the parallel processing throughput to virtually the maximum bandwidth capacity of the memory.
While in prior art structures an entire memory row is addressed during each operation, typically only one bit at a time is operated on. The present invention exploits the unused memory bandwidth by operating on all bits in the entire row in parallel. Further, the memory is the same memory accessed by the main computer processor, and not special memory used for the parallel processing elements as in the prior art.
By locating the processors on the same chip as the memory, the present invention exploits the extremely wide data path and high data bandwidth available as the sense amplifiers.
In one embodiment of the present invention, integrated into the memory chip is one processing element per sense amplifier. The memory is preferred to be the main computer memory, accessible by the central processing unit.
Alternatively, each processor element can be connected to more than one sense amplifier. When sense amplifiers belong to different arrays (or xe2x80x9ccoresxe2x80x9d) of memory, some of those cores need not perform a memory cycle, thereby reducing sensing power draw from a power supply.
In the prior art each parallel processor has its own memory, and the processors must communicate with each other, slowing down communication and being limited by inter-processor bus word length. In the present invention the main memory is used directly and may be accessed by a conventional single microprocessor at the same rate as conventional memories. Yet virtually the maximum bandwidth of the memory can be utilized using the parallel on-chip processing elements.
It should be noted that in the aforenoted NCR GAPP device, processors are located on the same chip as the memory. However because of the size of the processors, each processor communicates with 8 sense amplifiers, and requires extensive multiplexing. This slows the chip down because the maximum bandwidth of the memory cannot be utilized. In order to minimize the number of sense amplifiers dealt with by a single processor, the structure is limited to use with static memory cells, since the static memory cells are considerably wider in pitch than dynamic memory cells. Still, a very large number of sense amplifiers must be multiplexed to each processor element. Due to the smaller sense amplifier pitch required in a prior art DRAM chip, processors have not been put into a DRAM chip.
The present invention utilizes an unique form of processing element, based on a dynamic multiplexer, which we have found can be made substantially narrower in pitch than previous processing elements, such that the number of sense amplifiers per processing element can be reduced to 1, for static random access memories, and to 4 or fewer for dynamic random access memories. For the 1:1 ratio no multiplexing is required, and therefore in 1 memory cycle, with a single instruction given to all the processing element, all the bits of a row can be read, processed and written back to memory in parallel. For the larger ratio multiplexing is required of processing elements to sense amplifiers, but for the first time dynamic random access memories can have processing elements on the same chip, and can have a substantially increased number of parallel processing elements. For the dynamic memory, a typical ratio of processing elements to sense amplifiers would be 8:1 or 4:1, although as close to 1:1 as possible is preferred. The bandwidth of the processor to memory interface is thereby substantially increased, enormously increasing the processing speed.
Further, the invention allows direct memory access of the same memory having the on-chip processors by a remote processor. This renders the memory to be even more versatile, allowing flexibility in programming and applications.
In accordance with another embodiment of the invention, a novel simultaneous bidirectional buffer is described, which can logically connect two buses and actively drive the signal in either direction, either into or out from each processing element without prior knowledge of which direction the signal must be driven. Previously, bidirectional bus drivers utilized transmission gates or pass transistors, or bidirectional drivers which amplify but must be signalled to drive in one direction or the other.
As a result, the present invention provides a memory bandwidth or data rate which is several orders of magnitude higher than the bandwidth available with off-chip processing elements and prior art parallel processing designs. This is obtained in the present invention by connecting an on-chip processing element to each sense amplifier of a static random access memory, or to a very few of a dynamic random access memory. Each time the number of sense amplifiers per processing element doubles, the performance is halved. Wider processing elements are achieved to the detriment of speed. For this reason it is preferred that the number of sense amplifiers connected to each processing element should be no greater than four. Nevertheless it is preferred that there should be an equal number of processing elements, e.g. 1, for each sense amplifier (memory bit line). The processing elements thus each process a word 1 bit wide.
A novel processing element has been realized using a dynamic logic multiplexer for performing arithmetic and logical (ALU) operations, which results in a physically narrow processor element design. In an embodiment of the present invention the ALU instruction is multiplexed through the address pins in the memory. This considerably reduces the number of pins required per chip. In addition, one or a multiple of columns can be selected for read, write or communication with separate control of address lines and their compliments.
Due to system power constraints and integrated circuit pin current constraints, high density dynamic random access memories (DRAMs), for example in excess of 256 Kb, typically use only half or fewer of the sense amplifiers per memory cycle. It is desirable in an embodiment of the present invention to have all processing elements active in each cycle. In one embodiment of the present invention, half of the sense amplifiers and half of the memory element arrays can be active during each cycle, and the processing elements communicate with either of two adjacent memory arrays. Only one of those two memory element arrays have their bit lines precharged or have a word line asserted.
In an embodiment of the present invention two processing elements are stacked to permit plural ones per memory array, permitting use of wider processing elements.
In another embodiment a processing element can be connected to more than one memory array permitting some memory arrays to be inactive during a given processor/memory cycle, thus offering potential saving of power.
In summary, an embodiment of the invention is a random access memory chip comprised of static random access storage elements, word lines and bit lines being connected to the storage elements, a sense amplifier connected to corresponding bit lines, a separate processor element connected to each of the sense amplifiers, apparatus for addressing a word line, and apparatus for applying a single instruction to the processor elements, whereby the instructed processor elements are enabled to carry out a processing instruction in parallel on separate bits stored in the storage elements of the address word line.
In accordance with an embodiment of the invention, a method of operating a digital computer, said method comprising: addressing a memory; reading a row of data from the memory providing the same computational instruction simultaneously to each processor element of a plurality of processor elements, each of said processor elements being selectively coupled to a corresponding bit of said memory row of data; performing the same computational operation function on a selected plurality of bits of the data in parallel to provide a result; and writing said result in the memory at the same address from which the selected plurality of bits were read.