1. Technical Field
The present disclosure relates to a processing apparatus of an electronic product. More particularly, the present disclosure relates to a configurable processing apparatus having a plurality of processing units (which is referred to as PU), and a system thereof.
2. Description of Related Art
Presently, computers or other high-class electronic products all have central processors. The central processor is a processing apparatus used for processing data and executing instructions. With development of fabrication techniques, the processing apparatus are miniaturized, and in a single processing apparatus, a plurality of processing units can be configured to simultaneously process data and execute instructions, for example, a dual-core or a quad-core central processing unit provided by Intel Corporation.
Referring to FIG. 1A, FIG. 1A is a system block diagram of a conventional processing apparatus 10. The processing apparatus 10 uses a single instruction single data (SISD) structure. As shown in FIG. 1A, the processing apparatus 10 has a processing unit 101, a data buffer 102 and an instruction buffer 103. During each cycle, the processing apparatus 10 fetches one batch of data from a plurality of data stored in the data buffer 102 to the processing unit 101, and fetches one instruction from a plurality of instructions stored in the instruction buffer 103 to the processing unit 101. The processing unit 101 executes the received instruction and processes the received data according to the received instruction.
Referring to FIG. 1B, FIG. 1B is a system block diagram of a conventional processing apparatus 11. The processing apparatus 11 uses a multiple instructions single data (MISD) structure. As shown in FIG. 1B, the processing apparatus 11 has a plurality of processing units 111 and 112, a data buffer 113 and an instruction buffer 114. During each cycle, the processing apparatus 11 fetches one batch of data from a plurality of data stored in the data buffer 113 to the processing units 111 and 112, and fetches multiple instructions from a plurality of instructions stored in the instruction buffer 114 to the processing units 111 and 112. The processing units 111 and 112 respectively execute the received instructions, and process the received data according to the received instructions. The processing apparatus 11 of such MISD structure can respectively process the same data according to multiple instructions during each cycle.
Referring to FIG. 1C, FIG. 1C is a system block diagram of a conventional processing apparatus 12. The processing apparatus 12 uses an SIMD structure. As shown in FIG. 1C, the processing apparatus 12 has a plurality of processing units 121, 122 and 123, a data buffer 124 and an instruction buffer 125. During each cycle, the processing apparatus 12 fetches multiple data from a plurality of data stored in the data buffer 124 to the processing units 121-123, and fetches one instruction from a plurality of instructions stored in the instruction buffer 125 to the processing units 121-123. The processing units 121-123 respectively execute the received instruction, and process the received data according to the received instruction.
Referring to FIG. 1D, FIG. 1D is a system block diagram of a conventional processing apparatus 13. The processing apparatus 13 uses an MIMD structure. As shown in FIG. 1D, the processing apparatus 13 has a plurality of processing units 131-134, a data buffer 137 and an instruction buffer 138. During each cycle, the processing apparatus 13 fetches multiple data from a plurality of data stored in the data buffer 137 to the processing units 131-134, and fetches multiple instructions from a plurality of instructions stored in the instruction buffer 138 to the processing units 131-134.
Referring to FIG. 2A, FIG. 2A is a system block diagram of a conventional processing apparatus 14. The processing apparatus 14 uses a very long instruction word (VLIW) structure. The processing apparatus 14 has a plurality of processing units 141, 142 and 143, a data buffer 144, an instruction buffer 145 and a shared resource buffer 146. A length of an instruction word executed by the processing apparatus 14 of the VLIW structure is relatively long, and the instruction word (containing a plurality of instructions) can be processed during one cycle.
Referring to FIG. 2B, FIG. 2B is a schematic diagram illustrating instructions stored in the instruction buffer 145. The processing apparatus 14 of the VLIW structure fetches instructions from the instruction buffer 145. The instructions stored in the instruction buffer 145 are codes of the assembly language or codes of other type of machine codes generated via software compiling. During a first cycle, the instructions corresponding to addresses 41x00-41x04 in the instruction buffer 145 are read out as one instruction word, and the processing units 141-143 respectively receive the instructions in the instruction word, i.e. respectively receive the instructions of the addresses 41x00-41x04. Then, the processing units 141-143 respectively process the received instructions (the instructions of the addresses 41x00-41x04). In detail, the processing unit 141 adds the contents of registers r5 and l3 (r5 is a global register in the shared resource buffer 146, and l3 is a local register in the processing unit 141), and stores an adding result in the register r5. The processing unit 142 adds the contents of the registers r6 and r5, and stores an adding result in the register r6 of the shared resource buffer 146. The processing unit 143 performs a logic AND operation to the contents of the registers r7 and r8, and stores an operation result in the register r7 of the shared resource buffer 146.
Thereafter, during a second cycle, the instructions corresponding to addresses 41x06-41x0A in the instruction buffer 145 are read out as one instruction word, and the processing units 141-143 respectively receive the instructions in the instruction word, i.e. respectively receive the instructions of the addresses 41x06-41x0A. Then, the processing units 141-143 respectively process the received instructions (the instructions of the addresses 41x06-41x0A). In detail, the processing unit 141 performs a logic OR operation to the contents of the registers r1 and r2, and stores an operation result in the register r1 of the shared resource buffer 146. The processing unit 142 performs a subtraction operation to the contents of the registers r4 and r5, and stores an operation result in the register r4. The processing unit 143 performs a logic OR operation to the contents of the registers r9 and r7, and stores an operation result in the register r9.
It should be noticed that during the first cycle, the content of the register r5 is renewed, and during the second cycle, the processing unit 142 obtains the renewed content of the register r5 through the shared resource buffer 146. Therefore, the shared resource buffer 146 can share the renewed content to each of the processing units 141-143.
In an image processing system, if the processing apparatus using the VLIW or the SIMD structure is applied, a plenty of time is saved. However, regarding a file processing program, the processing apparatus using the VLIW or the SIMD structure probably cannot achieve the above advantage. Therefore, if a configuration of the processing apparatus can be changed according to different demands, a performance of the processing apparatus can be improved.
A situation that an unexpected stall is occurred to the processing unit is described below. Referring to FIG. 3, FIG. 3 is a schematic diagram illustrating a situation that the processing unit is stalled due to a data hazard. In this example, the processing unit has a pipeline structure, and pipeline stages thereof are sequentially five stages of instruction fetch, instruction decode, instruction execution, data access and write back. The shared resource registers r0-r15 have a hazard detecting circuit for detecting occurrence of a hazard and controlling stall of a suitable pipeline stage. Moreover, the processing unit has a forwarding circuit for forwarding data to the forward pipeline stages, so that a renewed data can be used by other instructions before it is written to the register.
During a time cycle t1, an instruction Ld r5, @x3 is fetched. During a time cycle t2, an instruction Sub r6, r4 is fetched, and meanwhile the instruction Ld r5, @x3 is decoded. During a time cycle t3, the instruction Ld r5, @x3 is in the instruction execution pipeline stage, and meanwhile an instruction And r7, r5 is fetched, and the instruction Sub r6, r4 is decoded. During a time cycle t4, data of the address @x3 is read into the processing unit, and the data is written into the register r5 during a time cycle t5. In this example, a programmer or a compiler expects a content of the register r5 used by the instruction Sub r6, r5 is a new data read from the address @x3 by the instruction Ld r5, @x3, and now the hazard detecting circuit detects an occurrence of the hazard. Therefore, during the time cycle t4, the instruction Sub r6, r5 stalls the instruction execution pipeline stage until the data of the memory address @x3 is read into the processing unit during the time cycle t5. During the time cycle t5, the data of the memory address @x3 is directly forwarded to the instruction Sub r6, r5 in the instruction execution pipeline stage through the forwarding circuit before being written into the register r5, so that the instruction execution can be immediately performed. Moreover, during the time cycle t4 when the execution stall is occurred, the instructions And r8, r5 and Or r9, r5 respectively in the instruction decode and the instruction fetch pipeline stages have to be simultaneously stalled.
Next, referring to FIG. 4A and FIG. 4B, FIG. 4A and FIG. 4B are schematic diagrams respectively illustrating an instruction fetch stall and a data access stall occurred in the processing unit. In the example of FIG. 4A, during the time cycle t1, since the data is not ready when the processing unit fetches the instruction, during the time cycles t1 and t2, the instruction fetch pipeline stage of the processing unit is stalled to wait the instruction data entering the processing unit. During the time cycle t3, the instruction data successfully enters the processing unit. Then the processing unit can successfully perform the instruction decode during the time cycle t4, and a next instruction can enter the instruction fetch pipeline stage of the processing unit. In an example of FIG. 4B, during the time cycles t4 and t5, the data is not ready when the processing unit read the data of the memory address @x3, so that during the time cycles t4 and t5, the data access pipeline stage of the processing unit is stalled, and the forward pipeline stage is also stalled, for example, the instruction Sub r6, r4 is also stalled at the instruction execution pipeline stage until the data of the address @x3 is successfully read into the processing unit during the time cycle t6, and then executions of the pipeline stages are carried on.