The invention relates to computers and particularly to parallel array processors.
The present U.S. patent application claims priority as a continuation-in-part application and is related to the following applications:
U.S. Ser. No. 07/526,866 filed May 22, 1990, of S. Vassiliadis et al, entitled: Orthogonal Row-Column Neural Processor (now U.S. Pat. No. 5,065,339, issued Nov. 12, 1991); and
U.S. Ser. No. 07/740,355 filed Aug. 5, 1991, of S. Vassiliadis et al, entitled: Scalable Nerual Array Processor, issued as U.S. Pat. No. 5,146,543; and,
U.S. Ser. No. 07/740,556 filed Aug. 5, 1991, of S. Vassiliadis et al, entitled: Adder Tree for a Neural Array Processor, issued as U.S. Pat. No. 5,146,420 and,
U.S. Ser. No. 07/740,568 filed Aug. 5, 1991, of S. Vassiliadis et al, entitled: Apparatus and Method for Neural Processor, abandoned in favor of U.S. Ser. No. 08/000,915, filed Jan. 6, 1993, issued as U.S. Pat. No. 5,251,287 and,
U.S. Ser. No. 07/740,266 filed Aug. 5, 1991, of S. Vassiliadis et al, entitled: Scalable Neural Array Processor and Method, issued as U.S. Pat. No. 5,148,515 and
U.S. Ser. No. 07/682,786 filed Apr. 8, 1991, of G. G. Pechanek et al, entitled: Triangular Scalable Neural Array Processor, abandoned in favor of continuation application U.S. Ser. NO. 08/231,853, filed Apr. 22, 1994, (now co-pending).
These applications and the present continuation-in-part application are owned by one and the same assignee, namely, International Business Machines Corporation of Armonk, N.Y.
The descriptions set forth in these above applications are hereby incorporated into the present application.
ALU
ALU is the arithmetic logic unit portion of a processor.
Array
Array refers to an arrangement of elements in one or more dimensions.
Array processors are computers which have many functional units or PEs arranged and interconnected to process in an array. Massively parallel machines use array processors for parallel processing of data arrays by array processing elements or array elements. An array can include an ordered set of data items (array element) which in languages like Fortran are identified by a single name, and in other languages such a name of an ordered set of data items refers to an ordered collection or set of data elements, all of which have identical attributes. An program array has dimensions specified, generally by a number or dimension attribute. The declarator of the array may also specify the size of each dimension of the array in some languages. In some languages, an array is an arrangement of elements in a table. In a hardware sense, an array is a collection of structures (functional elements) which are generally identical in a massively parallel architecture. Array elements in data parallel computing are elements which can be assigned operations, and when parallel can each independently and in parallel execute the operations required. Generally arrays may be thought of as grids of processing elements. Sections of the array may be assigned sectional data, so that sectional data can be moved around in a regular grid pattern. However, data can be indexed or assigned to an arbitrary location in an array.
Functional unit
A functional unit is an entity of hardware, software, or both, capable of accomplishing a purpose.
MIMD
A processor array architecture wherein each processor in the array has its own instruction stream, thus Multiple Instruction stream, to execute Multiple Data streams located one per processing element.
Module
A module is a program unit that is discrete and identifiable, or a functional unit of hardware designed for use with other components.
PE
PE is used for processing element. We use the term PE to refer to a single processor, which has interconnected allocated memory and I/O capable system element or unit that forms one of our parallel array processing elements. As the result of wiring, in our system, symmetric replicatable elements, are wired together for sharing interconnection paths.
SIMD
A processor array architecture wherein all processors in the array are commanded from a Single Instruction stream, to execute Multiple Data streams located one per processing element.
During the detailed description which follows the following works will be referenced as an aid for the reader. These additional references are:
1. R. J. Gove, W. Lee, Y. Kim, and T. Alexander, xe2x80x9cImage Computing Requirements for the 1990s: from Multimedia to Medicine,xe2x80x9d Proceedings of the SPIE Vol. 1444xe2x80x94Image Capture, Formatting, and Display, pp. 318-333, 1991.
2. R. Cypher and J. L. C. Sanz, xe2x80x9cSIMD Architectures and Algorithms for Image Processing and Computer Vision,xe2x80x9d IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 37, No. 12, pp. 2158-2174, December 1989.
3. K. E. Batcher, xe2x80x9cDesign of a Massively Parallel Processor,xe2x80x9d IEEE Transactions on Computers Vol. C-29, No. 9, pp. 836-840, September 1980.
4. L. Uhr, Multi-Computer Architectures for Artificial Intelligence, New York, N.Y.: John Wiley and Sons, chap. 8, p.97, 1987.
5. S.-Y. Lee and J. K. Aggarwal, xe2x80x9cParallel 2-D Convolution on a Mesh Connected Array Processor,xe2x80x9d IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, No. 4, pp. 590-594, July 1987.
6. E. B. Eichelberger and T. W. Williams, xe2x80x9cA Logic Design Structure for Testability,xe2x80x9d Proc. 14th Design Automation Conference, IEEE, 1977.
7. D. M. Young and D. R. Kincaid, xe2x80x9cA Tutorial on Finite Difference Methods and Ordering of Mesh Points,xe2x80x9d Proceedings of the Fall Joint Computer Conference, pp. 556-559, Dallas, Tex.: IEEE Press, November 1986.
8. E. Kreyszig, Advanced Engineering Mathematics. New York, N.Y.: John Wiley and Sons, chap. 9.7, pp. 510-512, 1968.
9. U.S. Ser. No. 07/799,602, filed Nov. 27, 1991, by H. Olnowich, entitled: xe2x80x9cMulti-Media Serial Line Switching Adapter for Parallel Networks and Heterogenous and Homologous Computer Systemsxe2x80x9d. systems which allow dynamic switching between MIMD, SIMD, and SISD.
10. U.S. Ser. No. 07/798,788, filed Nov. 27, 1991, by P. M. Kogge, entitled: xe2x80x9cDynamic Multi-mode Parallel Processor Array Architecturexe2x80x9d.
These additional references are incorporated by reference.
As background for our invention, the processing of visual information can be considered to consist of three different processing domains: image processing, pattern recognition, and computer graphics. The merger of image processing, pattern recognition and computer graphics is referred to as image computing and represents a capability required by the multimedia workstations of the future. xe2x80x9cMultimedia refers to a technique that presents information in more than one way, such as via images, graphics, video, audio, and text, in order to enhance the comprehensibility of the information and to improve human-computer interactionxe2x80x9d (See Additional Reference 1).
In the never ending quest for faster computers, engineers are linking hundreds, and even thousands of low cost microprocessors together in parallel to create super supercomputers that divide in order to conquer complex problems that stump today""s machines. Such machines are called massively parallel. Multiple computers operating in parallel have existed for decades.
Early parallel machines included the ILLIAC which was started in the 1960s. Other multiple processors include (see a partial summary in U.S. Pat. No. 4,975,834 issued Dec. 4, 1990 to Xu et al) the Cedar, Sigma-1, the Butterfly and the Monarch, the Intel ipsc, The Connection Machines, the Caltech COSMIC, the N Cube, IBM""s RP3, IBM""s GF11, the NYU Ultra Computer, the Intel Delta and Touchstone.
Large multiple processors beginning with ILLIAC have been considered supercomputers. Supercomputers with greatest commercial success have been based upon multiple vector processors, represented by the Cray Research Y-MP systems, the IBM 3090, and other manufacturer""s machines including those of Amdahl, Hitachi, Fujitsu, and NEC.
Massively Parallel Processors (MPPs) are now thought of as capable of becoming supercomputers. These computer systems aggregate a large number of microprocessors with an interconnection network and program them to operate in parallel. There have been two modes of operation of these computers. Some of these machines have been MIMD mode machines. Some of these machines have been SIMD mode machines. Perhaps the most commercially acclaimed of these machines has been the Connection Machines series 1 and 2 of Thinking Machines, Inc. These have been essentially SIMD machines. Many of the massively parallel machines have used microprocessors interconnected in parallel to obtain their concurrency or parallel operations capability. Intel microprocessors like i860 have been used by Intel and others. N Cube has made such machines with Intel ""386 microprocessors. Other machines have been built with what is called the xe2x80x9ctransputerxe2x80x9d chip. Inmos Transputer IMS T800 is an example. The Inmos Transputer T800 is a 32 bit device with an integral high speed floating point processor.
As an example of the kind of systems that are built, several Inmos Transputer T800 chips each would have 32 communication link inputs and 32 link outputs. Each chip would have a single processor, a small amount of memory, and communication links to the local memory and to an external interface. In addition, in order to build up the system communication link adaptors like IMS C011 and C012 would be connected. In addition switches, like a IMS C004 would be provided to provide, say, a crossbar switch between the 32 link inputs and 32 link outputs to provide point to point connection between additional transputer chips. In addition, there will be special circuitry and interface chips for transputers adapting them to be used for a special purpose tailored to the requirements of a specific device, a graphics or disk controller. The Inmos IMS M212 is a 16 bit process, with on chip memory and communication links. It contains hardware and logic to control disk drives and can be used as a programmable disk controller or as a general purpose interface. In order to use the concurrency (parallel operations) Inmos developed a special language, Occam, for the transputer. Programmers have to describe the network of transputers directly in an Occam program.
Some of these massively parallel machines use parallel processor arrays of processor chips which are interconnected with different topologies. The transputer provides a crossbar network with the addition of IMS C004 chips. Some other systems use a hypercube connection. Others use a bus or mesh to connect the microprocessors and there associated circuitry. Some have been interconnected by circuit switch processors that use switches as processor addressable networks. Generally, as with the 14 RISC/6000s which were interconected last fall at Lawarence Livermore by wiring the machines together, the processor addressable networks have been considered as coarse-grained multiprocessors.
Some very large machines are being built by Intel and nCube and others to attack what are called xe2x80x9cgrand challengesxe2x80x9d in data processing. However, these computers are very expensive. Recent projected costs are in the order of $30,000,000.00 to $75,000,000.00 (Tera Computer) for computers whose development has been funded by the U.S. Government to attack the xe2x80x9cgrand challengesxe2x80x9d. These xe2x80x9cgrand challengesxe2x80x9d would include such problems as climate modeling, fluid turbulence, pollution dispersion, mapping of the human genome and ocean circulation, quantum chromodynamics, semiconductor and supercomputer modeling, combustion systems, vision and cognition.
It is a problem for massively parallel array processors to attack adequately the image computing problems which exist. One particular algorithm used in image processing is convolution, which replaces each image pixel value with a weighted sum of the pixels in a defined surrounding area or window of pixels. A Mxc3x97M square convolution window consists of a set of Mxc3x97M weights, each corresponding to the associated pixels located in the window (Additional Reference 2). For an N by N array of pixels, the convolution algorithm requires M2N2 multiplication operations. Assuming an N of 1024 and a M of 3 a single image frame convolution would take 9 million multiplications and sum of product calculations per convolution and if the processing is on video data occurring at a rate of 30 frames per second then 270 million multiplications sum of product calculations per second would be required. For a uniprocessor to process this data, where each convolution window weight value must be fetched separately, with the multiply and add treated as separate operations, and followed by a write of the weighted average pixel result, the convolution would consist of 27 separate operations per pixel (9 reads, 9 multiplies, 8 adds, and 1 write) resulting in 27 millionxc3x9730 operations per second or 810 million operations per second (Additional Reference 1). Due to the high computational load, special purpose processors have been proposed to off load the image processing task from the system processor and to provide the adequate through put required for image computing. One of these special purpose processors is the nearest neighbor mesh connected computer (See Additonal References 2, 3, and 4-pp. 97) where multiple Processor Elements (PEs) are connected to their north, south, east west neighbor PEs and all PEs are operated in a synchronous Single Instruction Multiple Data (SIMD) fashion. It is assumed that a PE can communicate with any of its neighboring PEs but only one neighbor PE at a time. For example, each PE can communicate to their east neighbor PE, in one communication cycle. It is also assumed that a broadcast mechanism is present such that data and instructions can be communicated simultaneously to all PEs in one broadcast communication period. Bit serial interfaces are typical, as they were present in the Thinking Machines CM-1 family.
As is thus recognized, what is needed is a PE which can improve image computing, improve speed, and be adaptable to be replicated as part of a parallel array processor in a massively parallel environment. There is a need to improve the system apparatus for use in solving differential equations. We think a new kind of PE is needed for this problem. Creation of a new PE and massively parallel computing system apparatus built with new thought will improve the complex processes which need to be handled in the multi-media image computer field, and still be able to process general purpose applications.
The improvements which we have made result in a new machine apparatus. We call the machine which implements our invention the Oracle machine and we will describe it below. Our present invention relates to the apparatus which enables making a massively parallel computing system. We present a a new PE and related organizations of computer systems which can be employed in a parallel array computer system or massively parallel array processor.
We provide a massively parallel computer system for multi-media and general purpose applications, including the use of a finite difference method of solving differential equations. Our processor is a triangular processor array structure. Our processor array structure has single and dual processing elements that contain instruction and data storage units, receive instructions and data, and execute instructions and a processor interconnection organization and a way to support data initialization, parallel functions, wrap-around, and broadcast processor communications.
The computer has preferably N2 processing units placed in the form of an N by N matrix that has been folded along the diagonal and made up of single processor diagonal units and dual processor general units that are interconnected by a nearest neighbor with wrap-around interconnection structure. In the computer each processing element or PE is a unit of the matrix. Each processor is identified with a reference notation to the original N by N matrix prior to folding that supports the transport of N by N matrix algorithms to triangular array algorithms.
Prior to folding, each PE has four ports, and there are N2 processing units each possessing North, South, East and West I.O ports for nearest neighbor with wrap-around communications placed in the form of an N by N matrix that has been folded along the diagonal and allows the sharing of the North and South I/O ports with the East and West I/O ports.
For our processor with an N by N matrix the way of connecting processors is with a process providing a non-conflicting interprocessor communication""s mechanism. For example, a mechanism that utilizes a unidirectional communication strategy between processors can be utilized on the Oracle array processor. The non-conflicting interprocessor communication""s mechanism can be obtained by requiring all processors utilize a unidirectional and same direction communication strategy.
With our notation each said processing unit is identified by a two subscript notation PEcolumn,row in reference to the original N by N matrix prior to folding. Accordingly the computing apparatus will have K(N2) interconnection wires where K is the number of wires between processors, which for bit-serial interfaces K can be one (K=1). We support single processor diagonal units. The apparatus has single processor diagonal units, identified as PEi,j, including data storage elements, an execution unit, a broadcast interface for the communications of instructions and data, a data storage interface supporting initialization, and a nearest-neighbor with wrap-around interface, termed the interprocessor interface, and communication""s means.
We have also provided new facilities for computation, and these are described in the detail below.
These and other improvements are set forth in the following detailed description. For a better understanding of the invention with advantages and features, reference may be had to the description and to the drawings.