Large grain vector processing computers are well known in the art. The term "large grain" used in the art refers to a parallel processing machine having a small number of fast processors. The term "fine grain", also called massively parallel, refers to a parallel processor computer sing a very large number of relatively slow processors. The single processor of the Cray-1 supercomputer produced by Cray Research, Inc., the assignee of the present invention, contains a plurality of vector registers each of which is adapted for holding a plurality of elements in an ordered set of data. These vector registers are typically one word wide by n words deep, where the word length is 64 bits. The vector registers are connected to a plurality of functional units which receive the vector operands for executing instructions in response to opcodes and which have outputs for delivering the results computed or processed by the functional units. The operands presented from the vector registers to the functional units and received as output from the functional units may be queued up in a mode of operation known as vector chaining to increase throughput of the functional units. By using chaining, more than one result can be obtained per clock period. A detailed description of the Cray-1 supercomputer architecture is contained in U.S. Pat. No. 4,128,880 which is assigned to the assignee of the present invention and which is hereby incorporated by reference.
Multiprocessing vector register supercomputers are known in the prior art which combine a plurality of vector register processors to operate in parallel and to share a common local memory. Interaction between the vector register processors is accomplished by common semaphore and information registers. To provide interprocessor communication between the multiple vector register processors, a plurality of semaphore registers and a plurality of information registers are accessed by any one of the processors to facilitate interprocessor communication. Each processor then uses a local control circuit to accomplish, among other things, coordination of processors whereby delays associated with communicating between processors are avoided. A system for multiprocessor communication between register processors is described in U.S. Pat. No. 4,754,398 assigned to the assignee of the present invention, and which is hereby incorporated by reference.
In multiprocessor vector register supercomputers, each processor accesses a central common memory through a plurality of memory reference ports. These ports are connected to a plurality of registers which are directly addressed by the processor. The registers are used for holding information which can be used by the processor as operands. The shared memory includes a memory access conflict resolution circuit which senses and prioritizes conflicting references to the central memory from the processors thereby eliminating memory access collisions by the processors. A detailed description of vector register multiprocessing control and memory access is described in U.S. Pat. No. 4,636,942 assigned to the assignee of the present invention, and which is hereby incorporated by reference.
Vector register supercomputers of the prior art are designed to operate on large grain data which is arranged as vectors. The vectors are typically 64 bits wide and n words deep where n depends upon the machine vector length capabilities. In the prior art, the Y-MP supercomputer produced by Cray Research, Inc., the assignee of the present invention, allows a vector register length of 64 words thereby allowing a vector of 1 word by 64 words. The functional units of a single processor of the Y-MP supercomputer operate on large grain operands contained in the vector registers which may be at the most 1 word by 64 words. The functional units operate on the vector operands to perform such operations as floating point multiply, floating point add, vector add, logical operations and other operations.
The Cray multiprocessing supercomputers accomplish parallel processing of data in a form commonly termed MIMD (Multiple Instruction stream/Multiple Data stream). The MIND process is the ability of a multiprocessor computer to simultaneously process different instructions operating on different data streams. The granularity of parallelism in the Cray vector processor machines is a word-wide minimum granule of data. Matrix mathematics performed on the Cray MIND machines uses 1-word width elements as the matrix elements. Matrix multiplication, therefore, is a serial process of operating on the 1-word wide elements of the matrix until all elements of the matrix have been processed to create a resultant matrix. Depending on the matrix operation, the MIMD architecture will allow plural processors to operate on portions of the matrix at any given time.
In contrast to this architecture, a SIMD (Single Instruction stream/Multiple Data stream) machine can process multiple data streams using the same set of instructions simultaneously. In a fine grain SIMD parallel architecture, each element of a matrix is processed by a single processor. The plurality of processors operating in parallel all execute the same instructions on the data at the same time to produce parallel results. If the number of processors equals the number of elements in a matrix, the entire matrix can be processed in parallel simultaneously.
The data granularity of parallelism and the number of processors in SIMD and MIMD architectures determine the speed at which the architecture can process a matrix problem. In the prior art, the Connection Machine CM-2 produced by Thinking Machines Corporation of Cambridge, Mass. is a fine grain SIMD architecture with 64K (65,536) processors that execute data in parallel operations. Each processor operates on a single bit of data and all processors execute the same instruction in parallel. This type of parallel operation is fast; however, the amount of external communications between executions is enormous and at least partially offsets the advantage of the massively parallel execution.
The CM-2 machine is a special purpose processor which requires a front-end or support processor to download the data for specialized execution. The support processor handles the user interface, executes the user's programs, handles the I/O with the CM-2 and performs the scalar operations. Additional communication time is required, therefore, to download the operation to the specialized CM-2 processor to take advantage of the massively parallel operation of the single bit processors. One of the contrasts between the Y-MP MIMD and CM-2 SIMD architectures is the granularity of parallelism. The Y-MP computer has 8 parallel processors operating on 64-bit words arranged as 64 word vectors. All 8 processors can simultaneously operate on different vector data streams. The CM-2 machine can operate on 64K (65,536) separate single-bit data streams; however, all processors execute the same instruction in parallel.
There is a need in the prior art, therefore, to implement SIMD-style bit manipulation instruction sets in large grain MIMD type computers to allow large grain MIMD-type computers to emulate fine grain SIMD operation. In this fashion, there is a need in the prior art for instructions which will treat vectors as a plurality of independent multiple data streams and operate on the data in those vectors in parallel. In particular, there is a need in the prior art for bit manipulation instructions in a MIMD-type machine. These can be used to accomplish SIMD-style operation on the data, among other things. For example, there is a need in the prior art for array operations which treat each element of the array as a single bit and operate on the entire array in parallel using MIMD architectures.