1. Field of the Invention
The present invention relates to a scanning system using tree structures which is constituted by propagation operating units and designed to execute scan operations (also called associative processing, prefix processing, or propagation processing) for calculating a1, (a1 a2), (a1 a2 a3) . . . , and a1 a2 a3 . . . aM (where is an operator, called an associative operator, representing "+" , " ", " ", MIN (for obtaining a minimum value), MAX (for obtaining a maximum value), or the like) with respect to a large-scale data array of a1, a2, a3, . . . , and aM.
2. Description of the Related Art
Scan operations are of one type of basic parallel operations. By increasing the speed of these scan operations, the processing time required for various types of operations in a parallel data processing apparatus is shortened.
As a conventional method o#increasing the speed o#scan operations, a method based on processing units arranged in a binary tree structure is known, which is disclosed in, e.g., G. E. Blelloch, "Scans as Primitive Parallel Operations", Proc. Int. Conf. Parallel Process., Aug. 1987, pp. 355-362 (reference 1).
In a scanning system using tree structures according to Blelloch, processing units 1, each constituted by one (or two) arithmetic and logic unit (ALU) and an m-bit memory, are coupled to each other, forming a binary tree structure, as shown in FIG. 1(A) and (B). This system performs two types of processing: 1) sequentially performing scan operations from the lowermost layer to the uppermost layer in such a manner that predetermined associative processing is performed between two inputs supplied from the lower adjacent layer, and the processing result is output to the upper adjacent layer while an input (represented by a numerical value in a box representing a processing unit in FIG. 1(A)), supplied from a processing unit at a lower left position of the lower adjacent layer, is stored in a memory of the corresponding unit (up sweep processing) (FIG. 1(A)); and 2) sequentially performing scan operations from the uppermost layer to the lowermost layer in such a manner that an offset of the processing unit 1 located at a lower right position of the lower adjacent layer is obtained on the basis of the data obtained by the up sweep processing and stored in the memory (down sweep processing) (FIG. 1(B)). In these two types of processing, the scan processing results are output from the respective processing units 1 of the lowermost layer. As is apparent from FIGS. 1(A) and (B), since the number of layers to which the processing units 1 belong is represented by log.sub.2 M, if processing for each layer requires one step, scan operations for the overall array can be executed in 2log.sub.2 M steps required for both up sweep and down sweep. In this case, the offset of each processing unit 1 corresponds to the scan processing result obtained immediately before the scan processing region covered by the processing unit 1. Therefore, the scan processing result at a given point can be obtained by combining a corresponding offset with the local scan processing result of the corresponding processing unit 1.
In this scheme, special ALUs are arranged for up sweep and down sweep, respectively, and registers are arranged between the respective layers to realize pipeline processing, thus equivalently reducing a number of required steps to 1. However, in general processing in which scan operations are not performed repeatedly, steps of a number corresponding to a latency of 2log.sub.2 M of pipeline processing is practically required. This means that even if a number of required steps is extremely reduced as compared with the latency, the contribution of such reduction to high-speed processing is small.
For this reason, in the reference 1, Blelloch proposes a bit pipelining method (sequentially performing scan operations from the least significant digit of a scan target data array in units of bits), wherein a reduction in hardware size can be achieved, although the number of required steps increases in proportion to the word length of the scan target data array. Even this method, for example, has the following problems:
1) When a data array having a short word length, such as image data, is to be processed, the latency cannot be neglected.
2) Although the bit width of each ALU can be reduced to 1 bit, the memory capacity, of each processing unit, required to store an associative processing result cannot be reduced, resulting in an insufficient reduction in hardware size of each processing unit.
As an apparatus which can solve the problems of pipeline processing, a scanning system using tree structures is proposed by the present inventor in "Scanning Apparatus Using Tree Structures Suitable for LSI", the transactions of the Institute of Electronics, Information and Communication Engineers C-II, Vol. J74-C-II, No. 5, pp. 388-397, May 1991. This apparatus executes scan operations with a delay step count of the order of N (log.sub.N M) (capable of reducing a delay time per step to several ns or less and executing scan operations with M of several hundreds in hundred ns or less, i.e., a few clock cycles or less without performing pipeline processing). FIG. 2 shows a parallel data processing apparatus comprising a processing element (PE) array and a scanning system. In the scanning system propagation operating units for executing scan operations of the scheme described in the above literatured of the present inventor are arranged in a tree structure.
In this apparatus, the propagation operating units (POUs) 10a are coupled, forming a tree structure, and the respective units are connected to the processing element (PE) array 60.
As shown in FIG. 3, each propagation operating unit 10a is constituted by (N-1) cascaded propagation elements 30a, and has a base-N arrangement for performing scan operations for each local region (a sub array constituted by N cascaded elements of a scan target array) by performing propagation processing between the respective propagation elements 30a.
Each propagation element 30a comprises scan processing ALUs 32a and 33a, an AND gate 31 for propagation start point control, selectors (SELs) 34 and 35, and the like, as is shown in FIG. 4. Referring to FIG. 4, reference symbol FL.sub.i denotes propagation start point attribute data representing a propagation start point at which a scan region is divided into small regions; DI.sub.i, scan target data; and fd.sub.i, the offset of a scan processing result to be output to a propagation operating unit of the lower adjacent layer. The propagation start attribute data FL.sub.i and the scan processing target data DI.sub.i are input from a propagation operating unit of the lower adjacent layer or a processing element (PE) of the PE array (denoted by reference numeral 60 in FIG. 2).
Referring to FIG. 4, reference symbol CD.sub.i denotes the logical OR of propagation start point attribute data between the propagation elements 30a; and fui denotes an intermediate result of scan operations.
In addition, reference symbol DUI denotes an offset received from the propagation operating unit 10a of the upper adjacent layer.
As is apparent from the tree structure of the propagation operating units 10a and the arrangement of each propagation element 30a, each propagation operating unit 10a of the lowermost layer executes scan operations with respect to scan target data DI.sub.i of a scan processing target data array in a local region covered by each unit through the ALUs 32a and the selectors 34 of the propagation elements 30a. Results CD.sub.N and fu.sub.N are output as start point attribute data FL.sub.i and scan target data DI.sub.i with respect to a corresponding one of the propagation operating units 10a of the second lowest layer. In this case, if the start point attribute point FL.sub.i from the lower adjacent layer represents an active state ("0" in this arrangement, representing that the scan target data DI.sub.i is propagation start point data), the selector 34 is controlled to stop the propagation of the result obtained at the previous stage, and the corresponding propagation element is set as a new start point.
In the second and subsequent stages, CD.sub.N and fu.sub.N are obtained and output as the propagation start point attribute data FL.sub.i and the scan target data DI.sub.i with resect to the propagation operating unit 10a of the upper adjacent layer.
As a result, the scan operating results of the local regions are accumulated in the propagation operating unit 10a of the uppermost layer. In contrast, if "0" is supplied, as the scan target datum DUI, to the uppermost layer, a terminal fd.sub.i outputs an offset to each propagation operating unit 10a of the lower adjacent layer, as is apparent from the arrangement of the propagation element 30a constituted by the ALU 33a, the selector 35, and the AND gate 31. Since each propagation operating unit 10a of the lower adjacent layer calculates and outputs the offset of a corresponding propagation operating unit of the lower adjacent layer on the basis of the received offset, the scan operating result with respect to the data array immediately before the terminal fd.sub.i is obtained, as an offset, at the terminal fd.sub.i of each propagation operating unit of the lowermost layer.
The propagation operating units 10a of layers other than the uppermost layer are designed such that when the propagation start attribute data FL.sub.i is "0", an offset from an upper layer is neglected in the subsequently processing, and scan operations are newly started from a propagation start point.
In this arrangement, since scan operations are performed in tree-like hierarchical layers, the number of propagation elements through which the operating data pass in the scan operations is reduced to about the order of Nlog.sub.N M (where M is the array size of the scan target data array).
The number of propagation elements can be further reduced equivalently by introducing a selective propagation scheme into the propagation processing system of each propagation operating unit 10a. This scheme is disclosed in "Parallel Data Processing Apparatus" of Published Unexamined Japanese Patent Application No. 63-193232 of the present inventor. In the selective propagation scheme, propagation operations are performed for all the combinations of input signal values before an input from the forwarding stage is determined, and propagation is realized by only selecting the forwarding processing results when the input is determined. In the selective propagation scheme applied to each propagation operating unit, however, even if propagation is limited to a 1-bit signal, two combinations of an input are present, and the propagation operating paths corresponding to the number of combinations must be arranged in parallel. Therefore, the number of operating units and selectors required for scan operations is doubled as compared with a case wherein selective propagation is not utilized. In addition, the arrangement of each propagation element is complicated.
As described above, in the conventional scanning system using tree structures, when the size of a scan target data array is large, even if a pipeline structure is formed, the latency in which scan operations are performed through the propagation operating unit of the tree is long. Hence, the scan operating performance corresponding to the degree of parallel processing capability cannot be obtained. Since processing units, memories, pipeline registers, and the like must be incorporated in the respective units constituting a tree, a hardware amount as much as that of a processor section is required. In order to set a scan start point at an arbitrary position (required to perform parallel scan operations in the respective local regions of the array), a start point setting control logic must be incorporated into each propagation operating unit. The above-described problems and the like are still left unsolved.
By the way, a method of performing high-speed addition of two data by using techniques similar to the present invention is disclosed in Published Unexamined Japanese Patent Application No. 1-300337 or "Carry-Select Adder", IRE TRANSACTIONS ON ELECTRON COMPUTERS, June, pp. 340-344, which realizes high-speed processing by selecting previous processing results. The arrangement of this conventional is similar to that of the scanning system of the present invention. However, they are different from each other in the following respects:
1) In order to increase the processing speed and to decrease the hardware size, the conventional apparatus also utilizes the intrinsic characteristics of carry generation. As the result, the conventional apparatus dose not employ a regular tree structure in which each 2-input selector is controlled by an offset value as in the present invention.
2) The present invention is designed for scan operating of each bit of a large number of element data constituting a data array. By contrast, the carry-select adder is basically designed for addition of two data.
With regard to point 1), however, an adder using a tree constituted by only selectors, similar to the present invention, has been proposed, although the symmetry of the tree is slightly low. Since this adder employs a scheme for performing addition by only up sweep processing, even though it is possible to increase the processing speed, various problems are posed, e.g., low regularity of the tree and the necessity of using large sized hardware which is proportional to the square of a word length.