During the detailed discussion of our inventions, we will reference other work including our own unpublished works, as mentioned above. These background literature references are incorporated herein by reference.
In the never ending quest for faster computers, engineers are linking hundreds, and even thousands of low cost microprocessors together in parallel to create super supercomputers that divide in order to conquer complex problems that stump today's machines. Such machines are called massively parallel. Multiple computers operating in parallel have existed for decades. Early parallel machines included the ILLIAC which was started in the 1960s. Other multiple processors include (see a partial summary in U.S. Pat. No. 4,975,834 issued Dec. 4, 1990 to Xu et al) the Cedar, Sigma-1, the Butterfly and the Monarch, the Intel ipsc, The Connection Machines, the Caltech COSMIC, the N Cube, IBM's RP3, IBM's GF11, the NYU Ultra Computer, the Intel Delta and Touchstone.
Large multiple processors beginning with ILLIAC have been considered supercomputers. Supercomputers with greatest commercial success have been based upon multiple vector processors, represented by the Cray Research Y-MP systems, the IBM 3090, and other manufacturer's machines including those of Amdahl, Hitachi, Fujitsu, and NEC.
Massive Parallel (MP) processors are now thought of as capable of becoming supercomputers. These computer systems aggregate a large number of microprocessors with an interconnection network and program them to operate in parallel. There have been two modes of operation of these computers. Some of these machines have been MIMD mode machines. Some of these machines have been SIMD mode machines. Perhaps the most commercially acclaimed of these machines has been the Connection Machines series 1 and 2 of Thinking Machines, Inc. These have been essentially SIMD machines. Many of the massively parallel machines have used microprocessors interconnected in parallel to obtain their concurrency or parallel operations capability. Intel microprocessors like i860 have been used by Intel and others. N Cube has made such machines with Intel '386 microprocessors. Other machines have been built with what is called the "transputer" chip. Inmos Transputer IMS T800 is an example. The Inmos Transputer T800 is a 32 bit device with an integral high speed floating point processor.
As an example of the kind of systems that are built, several Inmos Transputer T800 chips each would have 32 communication link inputs and 32 link outputs. Each chip would have a single processor, a small amount of memory, and communication links to the local memory and to an external interface. In addition, in order to build up the system communication link adaptors like IMS C011 and C012 would be connected. In addition switches, like a IMS C004 would be profited to provide, say, a crossbar switch between the 32 link inputs and 32 link outputs to provide point to point connection between additional transputer chips. In addition, there will be special circuitry and interface chips for transputers adapting them to be used for a special purpose tailored to the requirements of a specific device, a graphics or disk controller. The Inmos IMS M212 is a 16 bit process, with on chip memory and communication links. It contains hardware and logic to control disk drives and can be used as a programmable disk controller or as a general purpose interface. In order to use the concurrency (parallel operations) Inmos developed a special language, Occam, for the transputer. Programmers have to describe the network of transputers directly in an Occam program.
Some of these MP machines use parallel processor arrays of processor chips which are interconnected with different topologies. The transputer provides a crossbar network with the addition of IMS C004 chips. Some other systems use a hypercube connection. Others use a bus or mesh to connect the microprocessors and there associated circuitry. Some have been interconnected by circuit switch processors that use switches as processor addressable networks. Generally, as with the 14 RISC/6000s which were interconnected last fall at Lawarence Livermore by wiring the machines together, the processor addressable networks have been considered as coarse-grained multi-processors.
Some very large machines are being built by Intel and nCube and others to attack what are called "grand challenges" in data processing. However, these computers are very expensive. Recent projected costs are in the order of $30,000,000,00 to $75,000,000,00 (Tera Computer) for computers whose development has been funded by the U.S. Government to attack the "grand challenges". These "grand challenges" would include such problems as climate modeling, fluid turbulence, pollution dispersion, mapping of the human genome and ocean circulation, quantum chromodynamics, semiconductor and supercomputer modeling, combustion systems, vision and cognition.
One problem area involved in the implementation of a massively parallel processing system is visual information processing which can be considered to consist of three different processing domains: image processing, pattern recognition, and computer graphics. The merger of image processing, pattern recognition and computer graphics is referred to as image computing and represents a capability required by the multimedia workstations of the future. "Multimedia refers to a technique that presents information in more than one way, such as via images, graphics, video, audio, and text, in order to enhance the comprehensibility of the information and to improve human-computer interaction" (See Additional Reference 1).
Sorting is another area suitable for massive parallel processing.
Problems addressed by our Massively Parallel Multiple-Folded Clustered Processor Mesh Array
It is a problem for massively parallel array processors to attack adequately the image processing, finite difference method problems, and sorting problems which exist.
One particular algorithm used in image processing is convolution, which replaces each image pixel value with a weighted sum of the pixels in a defined surrounding area or window of pixels. A M.times.M square convolution window consists of a set of M.times.M weights, each corresponding to the associated pixels located in the window (Additional Cypher et al.). For an N by N array of pixels, the convolution algorithm requires M.sup.2 N.sup.2 multiplication operations. Assuming an N of 1024 and a M of 3 a single image frame convolution would take 9 million multiplications and sum of product calculations per convolution and if the processing is on video data occurring at a rate of 30 frames per second then 270 million multiplications sum of product calculations per second would be required. For a uniprocessor to process this data, where each convolution window weight value must be fetched separately, with the multiple and add treated as separate operations, and followed by a write of the weighted average pixel result, the convolution would consist of 27 separate operations per pixel (9 reads, 9 multiplies, 8 adds, and 1 write) resulting in 27 million.times.30 operations per second or 810 million operations per second (Additional Gove et al.). Due to the high computational load, special purpose processors have been proposed to off load the image processing task from the system processor and to provide the adequate through put required for image computing. One of these special purpose processors is the nearest neighbor mesh connected computer (See Additional Cypher et al., Batcher, and Uhr-pp. 97) where multiple Processor Elements (PEs) are connected to their north, south, east west neighbor PEs and all PEs are operated in a synchronous Single Instruction Multiple Data (SIMD) fashion. It is assumed that a PE can communicate with any of its neighboring PEs but only one neighbor PE at a time. For example, each PE can communicate to their east neighbor PE, in one communication cycle. It is also assumed that a broadcast mechanism is present such that data and instructions can be communicated simultaneously to all PEs in one broadcast communication period. Bit serial interfaces are typical, as they were present in the Thinking Machines CM-1 family.
In the Massively Parallel Array Processor (Pechanek et al. 1992) a single diagonal-fold processor array provided the computational needs for the image processing convolution and finite difference method applications. It is recognized as needed a method of scaling and enhancing the connectivity of the Massively Parallel Array Processor and providing a general purpose processing node architecture encompassing the image processing and finite difference method requirements while extending the capabilities to cover more general purpose applications such as sorting.