1. Field of Invention
This invention relates to architecture for array processors made up of interconnected processing elements where all processing elements share a common clock, and particularly relates to architecture for a programmable skew-tolerant array processing system, which uses a variable-cycle clock to maintain synchronism among processing elements scattered among several chassis, or otherwise subject to unacceptable wire delays.
2. Description of the Prior Art
There is a significant body of patent and publication art dealing with array processors, sometimes called image processors. Among array processors deployed, there is a body of art dealing with the interconnections among the great numbers of simple computer cells, each known generally as "processing element" or "PE." Such connections include the simple four-neighbor linear mesh (NESW), the hexagonal mesh and the "hypercube" of sixteen permanent connections. Such connections also include a reconfigurable arrangement of semipermanent assigned connections, published as the Morphic Image Transfer Engine (MITE). A prior patent application by one of the inventors provides program variability at the processing element level by making each processing element separately addressable. Array processors thus have become extremely variable in their processing-element-to-processing-element connectivity and data transfer demands, adding greatly to the already considerable speed-of-light electrical transmission length skew problems by enormously variable process delays, and thus mandating relatively slow clock cycles.
"Skew" is a phenomenon due to the unequal distance a common signal must travel to reach different destinations.
There is no known prior art implementing a programmable variable main clock cycle in an array processor for skew acceptance or control.
There is a general awareness in all data transmission arts of the desire for synchronism, or, failing synchronism, of the desire for return to synchronism by correction or acceptance. The generic term for loss of synchronism is "skew." There are a number of skew correction and skew acceptance techniques known in related arts. Magnetic tape, for example, is subject to linear physical deformation which may cause errors in reading a byte recorded transversely. The solution, in essence, is to skew the reading to match the skew of the tape, using electronic delays which vary track-to-track as a function of measured or predicted skew. In system-to-system communications, such as tape reader input to a computer, provision is usually made to operate with a buffer which can accept data at a first rate and hold it for transfer at a different rate. When the buffer fills, the faster system stops and waits for the slower system to clear the buffer.
In array processors, however, there is little chance to buffer the massive amounts of data passing from processing element to processing element. The "image" applied at the entrance to the array is moved through the array, with a great number of processing changes, until it exits the array or is dissipated in the array. A master clock usually provides the drumbeat which controls the cadence of image march through the PEs of the array. The master clock has a beat frequency slow enough to allow every PE to accomplish its assigned computation and accomplish its assigned data output and input transfers.
A problem arises when the array of processing elements is reconfigurable, as shown by the Kimmel et al MITE publication, because the speed-of-light electrical signal transfer is of variable length, and thus requires variable time. This problem is compounded when the processing element is individually accessible for varying the job and the data transfer characteristics according to programming.
A great number of processing elements, usually the entire array, is subject to the common clock cycle. Error-free array processor operation requires that the clock cycle be long enough to carry out the worstcase calculation and data transfer called for by the program. Conversely, efficient array processor operation requires that the clock cycle be as quick as possible. These competing requirements present a clock cycle dilemma.
The prior art has not solved the clock cycle dilemma for array processors; prior array processors operate on a common clock cycle which does not vary.
The following patents are representative of prior art:
U.S. Pat. No. 4,024,498, McIntosh, APPARATUS FOR DEAD TRACK RECOVERY, May 17, 1977, shows an automatic variable synchronized multiple clock scheme, for the NRZI tape data format, to recover a lost track by using for a given track a clock signal ranging from 1 unit of length to 2 units, 4 units up to 8 units.
U.S. Pat. No. 4,040,032, Kreiker, PERIPHERAL DEVICE CONTROLLER FOR A DATA PROCESSING MACHINE, Aug. 2, 1977, shows multiple synchronized clocks in a bus protocol to control peripheral devices attached to the bus.
U.S. Pat. No. 4,201,948, Natens, PHASE-LOCKED LOOP CLOCK PULSE EXTRACTION CIRCUIT, May 6, 1980, shows a phase-locked loop scheme to correct the clock skew. The phase comparator produces first and second intermediate pulse waveforms constituted by differente portions of the pulses of the input waveform and proportional to the pulse density.
U.S. Pat. No. 4,313,206, Woodward, CLOCK DERIVATION CIRCUIT FOR DOUBLE FREQUENCY ENCODED SERIAL DIGITAL DATA, Jan. 26, 1982, shows a circuit using four clock inputs (the unit clock, the 1/4 clock, the 2/4 clock and the 3/4 clock waveforms) to derive a combination of all possible clock edges. This operates on a non-return-to-zero data signal to reconstitute a fixed-frequency clock from guaranteed events (transitions) of the data stream.
U.S. Pat. No. 4,393,419, Arai et al, SYNCHRONIZING SIGNAL DETECTION PROTECTIVE CIRCUIT, July 12, 1983, shows an automatic circuit to correct "microscopic" clock skew of a degree of few transistors. When skew does not occur, the signal with a short gating duration is selected, and only when skew, dropout, jitter or the like occurs the signal with a long gating duration is selected.
U.S. Pat. No. 4,464,739, Moorcroft, SAMPLED TOWED ARRAY TELEMETRY, Aug. 7, 1984, shows a data acquisition scheme for an acoustic detection system. The triggering edge of the clock is applied to all data acquisition modules simultaneously; each module controls the delay of the trailing edge, which thus adjusts the standard clock to a variable cycle appropriate for the module. The modules are in series for sequential data collection.
U.S. Pat. No. 4,468,737, Bowen, CIRCUIT FOR EXTENDING A MULTIPLEXED ADDRESS AND DATA BUS TO DISTANT PERIPHERAL DEVICES, Aug. 28, 1984, shows a circuit and a bus protocol to eliminate clock skew for remote peripheral devices. "Retime logic" regenerates the timing and control signals.
U.S. Pat. No. 4,482,819, Oza, DATA PROCESSING SYSTEM CLOCK CHECKING SYSTEM, Nov. 13, 1984, shows a scheme to correct skew by sending a reference clock through a set of wires with identical length.
U.S. Pat. No. 4,493,048, Kung et al, SYSTOLIC ARRAY APPARATUSES FOR MATRIX COMPUTATIONS, Jan. 8, 1985, shows a systolic array system forming a mesh for digital signal processing problems via synchronizing the data flow in the mesh along simple and regular (hexagonal mesh or linear mesh) communication paths.
U.S. patent application Ser. No. 06/902,343, POLYMORPHIC MESH NETWORK IMAGE PROCESSING SYSTEM, by H. Li, filing date, Aug. 29, 1986. Li shows an architecture for an array processor in which each PE has a mesh connection determined by an addressable switching network within the PE.
The following publication is representative of the prior art:
1. Kimmel, Jaffe, Mandeville and Lavin, MITE: MORPHIC IMAGE TRANSFORM ENGINE AN ARCHITECTURE FOR RECONFIGURABLE PIPELINES OF NEIGHBORHOOD PROCESSORS, IEEE Computer Society Workshop on Computer Architecture for Pattern Analysis and Image Database Management--CAPAIDM, Miami Beach, Fla., Nov. 18-20, 1985.
The prior art provides for extending clock edges with limited length. The prior art solves skew problem at a "microscopic" level at the degree of skew of a few transistors delay.
In the usual digital system, the skew is taken care of by considering the worst case skew in related components of the system. Since the multiplicity of the usual digital system is not significantly great, the skew can usually be properly handled within the range of a few transistors. This is called "microscopic" skewing.
In a mesh-type (SIMD) Single-Instruction-Multiple-Datastream parallel processor, there is an array of individual processing elements which operate during a common clock cycle to carry out a common instruction on dynamically changing image data as the data is processed during passage through the array. It is common for data to pass from a first processing element to its immediate neighbor processing element, but data transfers from first processing element to a remote processing element are also useful. The skew in data transfer time in such short or long transfers must be accommodated and usually is accommodated by selecting the clock cycle so as to provide sufficient time for the worst case, that is, for the longest possible duration data transfer.
However, when multiplicity is great, such as in an SIMD parallel processing system consisting of 512.times.512 processing elements, each a small computer, the skew phenomenon becomes "macroscopic" and can range from a few transistors in a chip to several thousands of logic gates in the system, spread across many chips interconnected by boards, wires or even cables to other frames. Without proper handling, the "macroscopic" skew can cause significant performance degradation.
An SIMD parallel processing system consisting of N processing elements is very popular for image processing and computer vision applications. For such applications, it is usually organized as an M.times.M square mesh where M is the square root of N and each processing element is designated as PE(i, j) where both i and j run from 1 to M. Each processing element in the SIMD system receives a clock signal and a broadcast instruction, both of which are distributed by a central controller. It is the unequal amount of time required by the clock and the instruction to travel from the central controller to an arbitrary group of processing elements that causes the skew.
The skew for a pair of processing elements PE(s, t) and PE(m, n) is the time difference, d, the clock and the instruction arrive from the central controller. Accordingly, the skew of an SIMD parallel processing system "D" is the maximum of "d" among all s, t, m and n.
The performance degradation of an SIMD parallel processing system caused by the skew is profound. An activity that can be completed in an amount of time "R" can now only be completed in "D+R" to accommodate the skew. As a result, a symmetrical clock that considers the skew pays an overhead of D/R and incurs a performance degradation at that amount.
The overhead due to the skew in a single cycle is accumulative in the interprocessor communication of an SIMD parallel processing system. This leads to a very significant performance degradation. For example, when PE(s, p) wishes to communicate with PE(s, q) the value of PE(s, p) is passed to PE(s, p+1), then PE(s, p+2) and eventually reaches PE(s, q). Consequently, it takes (q-p) cycles (assuming q&gt;p) each of which includes an overhead "2D". The total overhead caused by skew in interprocessor communication is (q-p) * (2D). Such an overhead is the most significant reason why large SIMD systems suffer high interprocessor communication penalty.
It is also true that the skew is heavily dependent on the size of the SIMD system (total number of PEs), the relative layout of PEs, the packaging technology and the local skew caused by the components within the system. A conventional non-programmable approach in handling the skew needs a redesign for each new system with different parameters.