The present invention relates to a parallel-processing apparatus for accumulating the processing results of the parallel-processing apparatus and, more particularly, to a parallel-processing apparatus and method for accumulating processing results at a high speed with low power consumption.
Recently, many parallel-processing apparatuses for parallel-executing processing such as calculations have been studied and developed for higher-speed processing in the computer field. As one of the arrangements of parallel-processing apparatuses, an array is formed by arranging in a matrix a plurality of cells (also called processing elements) capable of singly executing processing, and the respective cells in this cell array parallel-operate to achieve calculation processing. A parallel-processing apparatus constituted by the cell array can perform, at a high speed with low power consumption, SIMD (Single Instruction Multiple Data) processing of parallel-executing common calculation for many data in image processing or the like.
Examples of the parallel-processing apparatus are a processing circuit (Sigematsu et al., U.S. Ser. No. 091472.392) which has a fingerprint sensor and fingerprint authentication circuit in each cell, and processes by parallel operation of all the cells whether a fingerprint obtained by the fingerprint sensor coincides with a registered fingerprint, and an apparatus (J. C. Gealow et al., xe2x80x9cSystem Design for Pixel-Parallel Image Processingxe2x80x9d, IEEE Transaction on very large scale integration systems, vol. 4, no. 1, 1996) in which each cell has an image processing circuit, and various image processes are done for an image acquired by an optical sensor or the like by parallel operation of all the cells.
A parallel-processing apparatus constituted by the cell array will be explained briefly. In this parallel-processing apparatus, as shown in FIG. 17, a plurality of cells 1701 each having a processing circuit are arrayed in a matrix, and perform parallel processing on the basis of data and an instruction supplied from a control circuit 1702. After parallel processing of the respective cells 1701, the control circuit 1702 accumulates processing results output from the processing circuits of the cells 1701, and generates and outputs the total processing results.
If the parallel-processing apparatus has many cells, the processing circuit in each cell is simplified, and the processing result of the processing circuit in the cell represents only true/false or a number having several digits. A parallel-processing apparatus with the above cell array arrangement is often applied to image processing. In image processing, each cell executes predetermined processing for several dots forming an image to be processed. For example, in image processing such as pattern matching, each cell performs image processing for dots in an image that are assigned thereto, and outputs xe2x80x9ctrue/falsexe2x80x9d or the like as a comparison result. After parallel processing of respective processes, the control circuit accumulates xe2x80x9ctruexe2x80x9d outputs from the processing circuits of cells, calculates the image matching ratio on the basis of the number of accumulated xe2x80x9ctruexe2x80x9d outputs, and generates the image matching ratio as a pattern matching processing result.
When a large number of processing circuits are independently distributed, like the above parallel-processing apparatus, data processed by respective processing circuits must be collected at one portion. If data cannot be collected at a high speed, this degrades the effect of high-speed calculation by parallel processing.
Accumulation processing of the parallel-processing apparatus for collecting processed data at one portion adopts a method of reading out processing results from the cell array and accumulating them, like a DRAM (Dynamic Random Access Memory), or a method of transferring processing results by respective cells in a bucket brigade manner and accumulating them.
According to the first method of reading out processing results from the cell array and accumulating them, like a DRAM, processing results are read out from respective cells as follows. In the first method, as shown in FIG. 18, a processing circuit 1802 in each cell 1801 is connected to a corresponding data bus 1822 via a switching element 1803 controlled by a select signal sent via a control line 1821. The select signal is generated by a select signal generation circuit 1812 in accordance with a signal from a control circuit 1811. The same select signal is input to cells 1801 on the same row of the cell array.
Each data bus 1822 connected via the switching elements 1803 is commonly connected to each column of the cell array, and is connected to a selector 1813. The selector 1813 connected to the respective data buses 1822 sequentially selects one data bus 1822 in accordance with a signal from the control circuit 1811, and connects the selected data bus 1822 to a counter 1811a in the control circuit 1811.
In the parallel processing circuit of FIG. 18 in which the cells 1801, control lines 1821, and data buses 1822 are connected, the control circuit 1811 controls the select signal generation circuit 1812 to enable the control lines 1821 in units of rows after processing of all the cells 1801, and turns on the switching elements 1803 of the cells 1801 connected to the enabled control line 1821. Each cell 1801 whose switching element 1803 is ON outputs the processing result of the processing circuit 1802 to the data bus 1822 via the switching element 1803.
The processing result output to the data bus 1822 is input to the selector 1813. The selector 1813 sequentially selects processing results output to the data buses 1822 of respective columns in units of columns, and sends the selected results to the counter 1811a. The counter 1811a counts the processing results sequentially sent in units of columns, thereby accumulating the processing results of all the cells 1801. The count operation of the counter 1811a accumulates the processing results of all the cells 1801.
However, the first method requires a select signal generation circuit for selecting a control line and a selector for selecting a data bus, which increases the area of the parallel-processing apparatus. In addition, the processing circuit of each cell must drive a data bus in order to output a processing result, which decreases the speed and increases power consumption.
According to the second method of transferring processing results by respective cells in a bucket brigade manner, processing results are read out from respective cells as follows. In the second method, as shown in FIG. 19, each cell 1901 has a register 1903 and selector 1904 in addition to a processing circuit 1902. The selector 1904 selects either of data from an adjacent cell 1901 that is input via an input signal line 1921, and a processing result from the processing circuit 1902, and outputs the selected data to the register 1903. The register 1903 holds a signal from the selector 1904 in accordance with a write signal from the control circuit 1911 via a write signal line 1922, and outputs the held signal to an adjacent cell 1901. All the cells 1901 are connected in an array, and an output from the final cell 1901 is input to a counter 1911a in a control circuit 1911.
According to the second method, in the parallel-processing apparatus, after the processes of all the cells 1901 are completed, the processing result of each processing circuit 1902 is selected by the selector 1904 and held by the register 1903. Then, the selector 1904 selects a signal from an adjacent cell 1901, and sends a write signal to the registers 1903 in all the cells 1901 to transfer the processing result held by a corresponding register 1903 to an adjacent cell 1901. Transfer of the processing result can be repeated by the total number of cells 1901 to transmit the processing results of all the cells 1901 to the counter 1911a. The counter 1911a can count the transmitted processing results to accumulate them.
However, the second method must transmit a write signal for the register 1903 to all the cells 1901 by the total number of cells 1901. If the number of cells 1901 is large, power is greatly consumed. If a skew caused by a delay or the like is generated in a signal in transmitting a write signal, the register 1903 may fail in write. To prevent the write failure by the register 1903, a multilevel write signal must be used, or a delay circuit or the like must be inserted in a write signal line, resulting in a low accumulation speed.
As described above, to read out and accumulate processing results from a cell array, like a DRAM, conventional parallel processing requires a select signal generation circuit for selecting a control line and a selector for selecting a data bus, which increases the area of the parallel-processing apparatus. In this method, the processing circuit of each cell must drive a data bus in order to output a processing result, which decreases the speed and increases power consumption.
In the method of transferring processing results by respective cells in a bucket brigade manner and accumulating them, a register write signal must be transmitted by the total number of cells. If the number of cells is large, power is greatly consumed. If a skew caused by a delay or the like is generated in a signal in transmitting a write signal, the register may fail in write. To prevent this write failure, a multilevel write signal must be used, or a delay circuit or the like must be inserted in a write signal line, resulting in a low accumulation speed.
The present invention has been made to overcome the conventional drawbacks, and has as its object to accumulate the processing results of all the cells at a high speed with low power consumption in a parallel-processing apparatus constituted by a plurality of cells for performing processing.
To achieve the above object, according to the present invention, there is provided a parallel-processing apparatus comprising a plurality of cells each having a processing circuit for performing arbitrary processing, variable-delay circuits which are respectively arranged in the cells, change a signal propagation delay in accordance with processing results of the processing circuits in corresponding cells, and are series-connected over the plurality of cells, signal output means for outputting a measurement input signal to a first variable-delay circuit of a variable-delay circuit array constituted by series-connecting all the variable-delay circuits, a delay counter for receiving the measurement input signal output from the signal output means and a measurement output signal output from a final variable-delay circuit of the variable-delay circuit array upon input of the measurement input signal to the first variable-delay circuit of the variable-delay circuit array, and obtaining a signal propagation delay time of the variable-delay circuit array on the basis of the measurement input and output signals, and accumulation means for accumulating processing results of the processing circuits in the plurality of cells on the basis of the signal propagation delay time of the variable-delay circuit array obtained by the delay counter.