1. Field of the Invention
This invention relates to image processors which process data streams passing through arrays to processing elements, each processing element executing an assigned instruction, and more particularly relates to an architecture for adaptively manipulating the instruction assignment of each processing element in response to spatial and data values in the data stream, using an instruction adapter, individual to each processing element, to derive a new instruction as a composite function of processor identification and status and of the data stream.
2. Description of the Prior Art
The following publications are representative of the prior art:
U.S. Pat. No. 3,287,702, Borck, Jr., et al, COMPUTER CONTROL, Nov. 22, 1966, shows an array processor with computer control of an array of conventional processing elements.
U.S. Pat. No. 3,287,703, D. L. Slotnick, COMPUTER, Nov. 22, 1966 shows a similar array processor.
U.S. Pat. No. 3,970,993, C. A. Finnila, COOPERATIVE-WORD LINEAR ARRAY PARALLEL PROCESSOR, July 20, 1976 shows an array processor in which each processing element includes a flag register which can modify the operations on the common control lines.
U.S. Pat. No. 4,187,539, J. R. Eaton, PIPELINED DATA PROCESSING SYSTEM WITH CENTRALIZED MICROPROGRAM CONTROL, Feb. 5, 1980 shows a plural dataflow pipelined processor in which each dataflow includes a shift register which provides a sequence of microinstructions, and a common microprogram control unit includes a flag register which helps keep the instruction size small by providing instruction information which does not change often.
U.S. Pat. No. 4,287,566, G. J. Culler, ARRAY PROCESSOR WITH PARALLEL OPERATIONS PER INSTRUCTION, Sept. 01, 1981, shows an array processor having subarrays used to calculate vector addresses.
U.S. Pat. No. 4,344,134, G. H. Barnes, PATRITIONABLE PARALLEL PROCESSOR, Aug. 10, 1982, shows a partitionable array processor in which each processor in a node tree issues a ready signal when the dataflow has passed it, thus invoking the next instruction.
U.S. Pat. No. 4,380,046, L-W. Fung, MASSIVELY PARALLEL PROCESSOR COMPUTER, Apr. 12, 1983, shows an array processor with each processing element equipped with a mask bit register, identified as G-register, to disable the processing element and thus distinguish between executing the current instruction or no-operation; that is, each processing element has a G-register with an OP/NOP flag.
U.S. Pat. No. 4,467,409, Potash et al, FLEXIBLE COMPUTER ARCHITECTURE USING ARRAYS OF STANDARDIZED MICROPROCESSORS CUSTOMIZED FOR PIPELINE AND PARALLEL OPERATIONS, Aug. 21, 1984, shows a flexible architecture for a sequential processor, using standard units with "soft functional structures" which customize a unit for a command. The units thus can be manufactured as standard units and customized by means of a mask which sets contacts in the soft functional structure.
U.S. Pat. No. 4,558,411, Farber et al, POLYMORPHIC PROGRAMMABLE UNITS EMPLOYING PLURAL LEVELS OF SUB-INSTRUCTION SETS, Dec. 10, 1985, shows a multiple-level programmable unit to provide a hierarchy of sub-instruction sets of microprogramming, to change, for example, from input output mode to processing mode or to exchange programs written in differing languages.
European Patent Application No. 84301600.7, Holsztynski, DATA PROCESSING CELLS AND PARALLEL DATA PROCESSORS INCORPORATING SUCH CELLS, Oct. 17, 1984, shows an array processor in which each processing element includes a full adder and storage device for N-S (north-south), E-W (east-west), and C (carry), so that the processing element can carry out both arithmetic and logic functions.
U.S.S.R. Author's Certificate No. 83-721416/30, ASSOCIATIVE PROCESSORS MICROPROGRAM CONTROL APPARATUS, Tbilisi Elva Combine, Sept. 15, 1982, shows first and second control instruction registers in instruction memory to allow the same microinstruction to be used for different instructions, reducing the overall volume of memory.
Davis et al, SYSTOLIC ARRAY CHIP MATCHES THE PACE OF HIGH-SPEED PROCESSING, Elecronic Design, Oct. 31, 1984, pp 207-218, shows a representative array processor.
NCR GEOMETRIC ARITHMETIC PARALLEL PROCESSOR, product specification NCR45CG72, NCR Corp., Dayton, OH, 1984, pp. 1-12, shows physical characteristics of a representative array processor.
Cloud et al, HIGHER EFFICIENCY FOR PARALLEL PROCESSORS, IEEE Southcon, reprint published by NRC Corporation Microelectronics, Div., Fort Collins, CO, pp. 1-7, shows details of operation of NCR's geometric arithmetic parallel processor (GAPP).
The prior art shows a variety of array processors, with individual processing elements controllable externally in a variety of manners, and with the possibility of OP/NOP according to a flag in the individual processing element--but the prior art does not teach the use of instruction adaptation within each individual adaptive processing element to make an array processor dynamically optimizable to spatial and data dependencies through derived instruction within the adaptive processing element.
Current computer systems are categorized, according to instruction stream and data stream, into four classes. They are:
SISD (Single Instruction stream Single Data stream).
SIMD (Single Instruction stream Multiple Data stream).
MISD (Multiple Instruction stream Single Data stream).
MIMD (Multiple Instruction stream Multiple Data stream).
Except for SISD, these architectures are parallel processing systems. However, none of them can perform parallel operations which are adaptive to the spatial condition of a processing element (spatial adaptation, e.g. data are at the border of an image or the processing element is at the first column of an array). Neither can they perform parallel operations adaptive to the nature of the data (data adaptation, e.g. data positive/data negative; flag true/flag false).
Supercomputers are commercially available now and exemplified by the Cyber series from CDC, the CRAY series from CRAY Research and the NEC AP series. All these machines are of MISD architecture and require a long setup time for setting up the instruction pipe to process a vector. The overhead is large if the frequency of the pipe setup is high or the vector is short; the performance is consequently low in such cases.
Data dependence in a loop degrades the performance of these supercomputers. The machines are either prevented from presetting the pipe until the data dependence is resolved (e.g. status is known exactly) or will set up the pipe for one path (e.g., status is true) with higher probability. The former case delays the execution while the latter case involves the resetting of the pipe (i.e. increase the pipe setup frequency) if the "guess" is wrong. Both cases degrade the performance.
The lack of spatial and/or data adaptation leads to the following drawbacks:
1. Data-dependent operations are processed sequentially, which leads toa waste of the parallel hardware, hence to lower performance;
2. Data with spatial significance are treated as exception, which prevent the parallel opportunity;
3. Interconnections of parallel computers are fixed, which restricts the algorithm versatility;
4. Complementary operations (e.g. SEND/RECEIVE pair) caused by data or spatial dependence are performed sequentially, which implies longer execution time;
5. Communication bandwidth is accordingly wasted;
6. Different copies of the program must be generated for processing elements (PEs) with different spatial conditions, which leads to larger software effort.
The prior art does not teach nor suggest the invention, which provides for instruction adaptation at the processing element level for spatial and data dependencies, by providing each of a finite number of processing elements with conditional instruction modification means.
To facilitate a quick understanding of the invention, it is helpful to describe the situations where data-dependent parallel processing and spatial-dependent parallel processing are involved, and where improved solutions, such as by means of the invention, are most desirable.
With adaptive instruction processing, the above problem could be handled in a parallel fashion as follows:
An instruction is defined as +/- (add or subtract) while, using the "status" as the "agreement bit," the derived instruction is defined as + (add) if the "status" is true, or is defined as - (subtract) if the "status" is false. The loop with data dependence can then be rewritten as EQU for(i=0;i&lt;300;i++) EQU for(j=0;j&lt;500;j++) EQU c[i,j]=a[i,j]+/-b[i,j];
and parallel processing can be applied efficiently.
This example demonstrates one instance of how data dependence can be resolved, and how the data dependent loops that were processed sequentially can now be parallelized. The opportunity of exploiting the parallelism that involves data dependence is not limited to the above example and is much wider in application.