1. Field of the Invention
The present invention relates to a dynamically reconfigurable operation apparatus capable of operating diverse processing by dynamically reconfiguring itself.
2. Description of the Related Art
A method for making a reconfigurable operation apparatus operate a plurality of tasks has conventionally been classified into two types: one, the space parallelism in which the circuit thereof is divided spatially and a task is allocated to each circuit; and two, the time parallelism in which the configuration is switched to a time sharing system for operating different tasks. The present invention relates to an operation apparatus, belonging to the latter type in the above classification, which enables a plurality of tasks in a time shared multi-access by changing over the configuration in synchronism with the machine clock, and a dynamic reconfiguration thereof.
An example for such operation apparatus is disclosed in a Japanese patent laid-open application publication No. 2001 312481. FIG. 1 is an illustration showing an example of a processor element array section of the array processor presented in the aforementioned publication. In FIG. 1, the processor element array (hereinafter called “PE array”) section is comprised so that each processor element 990 is surrounded by eight of programmable switch elements 991. Between the adjacent programmable switch elements, and between the adjacent programmable switch elements 991 and the processor element 990 are, respectively, connected by data buses 992 electrically. Also, the programmable switch elements 991 and the processor element 990 are hard-wired with an operation control bus 993. One processor element 990 is connected with a plurality of the programmable switch elements 991, thereby gaining a higher freedom in connecting with the external data input and output thereof.
FIGS. 2A and 2B describe the problem of data transfer delay in the conventional PE array shown in FIG. 1. FIG. 2A shows the operations of processor elements engaged in the PE array processing, while FIG. 2B shows the way the operation progresses by each processing cycle. FIGS. 2A and 2B take example of operating an equation (a+b)+(c−d)+(e+f) while inputting data from the left part of the PE array. In FIG. 2A, when six values a, b, c, d, e and f are inputted, by two values at a time, into the three consecutive switch elements PE1, PE2 and PE3 on the left side, each processor element located on the right below of the each switch element inputted with data operates, respectively, the addition a+b (=A), the subtraction c−d (=B) and the addition e+f (=C) in the cycle 1. The switch element PE 4 operates the addition A+B (=D) in the cycle 2, and at the same time the data transfer 1 is performed for the value C. The data transfer 2 is performed for the value C in the cycle 3, the switch element PE 5 operates the addition D+C (=E) in the cycle 4, the data transfers 3 and 4 are performed for the operation result E in the cycles 5 and 6, respectively, and the operation result E is outputted in the cycle 7.
It is apparent from the above that the data transfer occurs in the cycle 3, causing a delayed processing, and the data transfers 3 and 4 occur in the cycles 5 and 6, respectively, causing another delayed processing in spite of a processing completion in the cycle 4, resulting in a delayed output of the operation result E.
That is, as observed in the Japanese patent laid-open application publication No. 2001 312481, the comprisal in which not only the processor elements but also switch elements used for data transfers between the processor elements can cause a high probability of a delayed processing associated with data transfers.
FIGS. 3A through 3D each shows a possibility of problem occurrence depending on the processing content caused by functional allocation to processor elements constituting the PE array in a conventional operation apparatus, by taking the operation of equation (a+b)*(c−d) as an example. FIGS. 3A and 3B show operations of the PE array for a processing and the operation in each process cycle for the same processing where no problem occurs, while FIGS. 3C and 3D show operations of the PE array for a processing and the operation in each process cycle for the same processing where a problem occurs. In the PE array shown by FIGS. 3A and 3C, arithmetic logical units (hereinafter called “ALU”) and multipliers are assumed to be arrayed, respectively, as shown. The reason for defining the ALU and the multiplier differently is their complete physical differences.
In the case shown in FIG. 3A, since the ALUs and the multipliers are suitably arrayed for operating the equation (a+b)*(c−d), the operation is completed in 2 cycles and the above described data transfer in 3 cycles, and therefore the entire processing is finished in 6 cycles. Comparatively, in the case shown in FIG. 3C, since the ALUs and the multipliers are not suitably arrayed for computing the given equation, requiring 5 cycles for the operation itself because the input data −c and −d must be transferred to the processor elements used for computing, and 4 cycles are required to transfer data for the output because the processor element performing the processing is remotely located from the output switch element, thus resulting in taking a total of 10 cycles for the entire processing.
Allocation of processor elements in a PE array is established in the production thereof, which cannot be changed afterwards. Therefore, if a PE array is tried to be comprised by disparate ALU modules in a type of operation apparatus transferring data between adjacent switch elements, efficiencies of the processing will vary a great deal depending on the algorithm because it is impossible to pre-select a series of ALU modules suitable for arbitrary algorithms during the production stages according to the conventional techniques.
As such, it is difficult to use disparate operation elements for the PE array in constituting a PE array reconfigurable operation apparatus transferring data between operation elements by way of the two-dimensional array of switch elements.
FIGS. 4A and 4B also show the way a feed-back processing is done in the conventional operation apparatus, and let it assume the processing 1 is performed as follows.
D[0] = 1;for (i=0; i<5; i++){aa[i]=D[i]+a[i];B[i]=b[i]+aa[i];cc[i]=c[i]+d[i];D[i+1]=B[i]-cc[i];}-- (Called processing 1)
For easy understanding, the vertical columns of the switch elements constituting the PE array are numbered sequentially, from the left to right, 0, 1, 2, . . . , M (with (M+1) being the number of columns), and the horizontal rows are numbered sequentially, from the top to downward, 0, 1, 2, . . . ,N (with (N+1) being the number of rows), and then the switch element-S located at the column-j and row-k is to be denoted as S (j, k). Also, the processor element located at the column-j and row-k (denoted as PE (j, k) supposedly) is to be accessible by switch elements S (j, k), S (j, k−1), S (j−1, k) and S (j−1, k−1), all of which are located adjacent to PE (j, k).
For performing the processing 1 above, the data b[0] is inputted from the switch element S (0, 0), the data D[0] and a[0] are inputted from S (0, 1), and the data d[0] and c[0] are inputted from S (0, 2). Now, in the cycle 1, the add operation aa[0] (=D[0]+a[0]) at the processor element PE (0, 1) is executed, followed by the add operation cc[0] (=c[0]+d[0]) at the processor element PE (1, 1), as shown in FIG. 4. Then in the cycle 2, the add operation B[0] (=b[0]+aa[0]) is executed at the processor element PE (1, 0). Then, in the cycle 3, the subtract operation D[0+1] (=B[0]−cc[0]) at the processor element PE (2, 1) and also the data transfer 1 for the value B[0] from the switch elements S (2, 1) to S (3, 1) are performed. Still in the cycle 3, subsequently, the data transfer 1 for the value B[0] from the switch elements S (3, 1) to S (4, 1) and at the same time the processing result D[1] at the processor element PE (2, 1) is fed back to the PE (0, 1). As such, a feedback cycle is required for a loop processing. While there is one feedback cycle for the example above, as the number of cycles prior to a loop-back increases, the distance for returning in a feedback becomes longer, and therefore a loss in the process efficiency will become worse accordingly.
As described, also the process control architecture closely affects the process efficiency of a PE array reconfigurable operation apparatus transferring data between processor elements by way of the two-dimensional array of switch elements, and therefore a further loss in the process efficiency can incur, depending on a loop processing.
There is a problem associated with time being taken in a reconfiguration of the ALU modules, which occurs in changing the processing content, at the detection of a conditional branching, et cetera. Therefore an accumulation of such switching time with the number of reconfiguration occurrence can possibly cause overall performance degradation.
Among the several methods disclosed in the Japanese patent laid-open application publication No. 2001-312481, there is one for shortening the switching time as shown in FIG. 5. The operation apparatus shown in FIG. 5 illustrates a part in which a reconfiguration is done for groups of ALU modules consisting of a plurality of the ALU module units 990a arranged in two dimensions. Each ALU module unit 990a consists of an ALU module 996, an instruction memory 994 storing a plurality of instructions issued to the ALU modules, and an instruction decoder 995 decoding a selected instruction. The part reconfiguring a group of the ALU modules consists of a state control manager 997, a state transition table 998 and a selector 999. The state transition table 998 is searched by a current state number, and a next state number and an instruction address common to all the ALU modules within the group of the ALU modules are selected. In each ALU module unit 990a, an instruction is read out from inside the instruction memory by an instruction address received from outside of the group of the ALU modules, the instruction is interpreted by the instruction decoder 995 and the processing content is established for the ALU module 996 so as to execute the instructed processing. FIG. 6 shows an operation timing of the operation apparatus shown in FIG. 5. In FIG. 6, the cycle defined as the “arithmetic and logical operations” actually includes an “instruction memory read-out,” an “instruction decoding,” and a genuine “operation” are performed. Note that FIG. 6 shows the case in which the current state number is used when the instruction address is issued. In the operation timing for “the case in which the next state number is used when the instruction is issued,” although the timing itself for the instruction address issue can be advanced, the “instruction memory read-out,” the “instruction decoding,” and the genuine “operation” must be done because the arithmetic and logical operations are in the same timing in the above described Japanese patent laid-open application publication No. 2001-312481. Therefore, a delay between searching in the state transition table and the actual arithmetic and logical operation becomes problematic. Also the state transition table entries have both the next state numbers and instruction addresses. The state transition table is a memory whose size becomes larger with the number of entries, which in turn causes the problem of the area size becoming larger.