1. Field of the Invention
Apparatuses and methods consistent with the present invention relate to a loop accelerator and a data processing system having the same, and more particularly, to a loop accelerator capable of simplifying a connection structure between a configuration memory and processing elements (PEs) so as to easily modify a structure thereof and save cost and a data processing system having the same.
2. Description of the Related Art
In general, a program includes a part that must be repeatedly executed with a predetermined routine. When a data processing system executes the program, an additional loop accelerator separately executes the predetermined routine in order to rapidly process the program.
FIG. 1 is a view illustrating a configuration of a conventional data processing system.
The conventional data processing system includes a processor core 1, a central register file 2, and a loop accelerator 3.
The processor core 1 processes the part of a program except a loop part of the program repeatedly executed, and the loop accelerator 3 processes the loop part. The processor core 1 and the loop accelerator 3 share the central register file 2, and the central register file 2 serves to transmit data between the processor core 1 and the loop accelerator 3.
The loop accelerator 3 includes an array part 5 and a configuration memory 4.
A plurality of PEs 6 are arrayed in the array part 5 so as to form a matrix. Each of the PEs 6 performs an operation on each word and includes a functional unit (FU) for processing data and a distributed register file (RF) storing operated values.
The configuration memory 4 stores configuration bits provided to the PEs 6 of the array part 5.
In the conventional data processing system, the configuration memory 4 is connected to the PEs 6 by wires so as to transmit the configuration bits from the configuration memory 4 to the PEs 6. Thus, the wires must be as long as a distance between the configuration memory 4 and the PEs 6 to transmit the configuration bits to one of the PEs 6 at the longest distance from the configuration memory 4.
Due to the length of the wires, designing an array of the wires is complicated, and cost increases. A cycle of a clock signal must be set based on the longest one of the wires. Thus, the cycle of the clock signal is long, and thus a speed of transmitting the configuration bits is decreased.
To solve these problems, an eXtreme processing platform (XPP) processor 10 having a configuration designed in a tree form is suggested so as to transmit configuration bits from a configuration memory to the PEs 6 as shown in FIG. 2.
The XPP processor 10 is based on a hierarchical coarse-grained array (CGA) and includes one or more processing array clusters (PACs) 20. Each of the PACs 20 includes a plurality of processing array elements (PAEs) 50 each performing an operation on each word, and the PAEs 50 are arrayed in a matrix form so as to form rectangular blocks.
The XPP processor 10 includes a supervising configuration manager (SCM) 5 and configuration managers (CMs) 40 to transmit the configuration bits from the configuration memory to the PACs 20. The SCM 5 receives the configuration bits from the configuration memory through an external interface, and the CMs 40 connect the PACs 20 to the SCM 5 to transmit the configuration bits from the configuration memory to the PACs 20.
The CMs 40 include random access memories (RAMs) 41 storing the configuration bits received from the SCM 5 and sub-configuration managers (SMs) 43 of the CMs 40 providing the configuration bits to the PAEs 50 of the PACs 20.
A plurality of horizontal bus lines 31 are arrayed in a lattice form in each of the PACs 20 to transmit the configuration bits to the PAEs 50 arrayed in the matrix, and vertical bus lines 35 cross intersecting points of lattices of the horizontal bus lines 31. Switches 33 are installed between the intersecting points of horizontal bus lines of the horizontal bus lines 31 arrayed in one direction. The configuration bits are transmitted from the CMs 40 to the PAEs 50 through a configuration bus 37.
The PAEs 50 include arithmetic logic unit (ALU) Objects 51, forward register (RFEG) objects 53, and backward register (BREG) Objects 55 to which vertical data and event bus lines 57 are connected. The ALU Objects 51 include ALUs 51b and configuration registers 51a temporarily storing the configuration bits transmitted to input ports, output ports, and the ALUs 51b. 
The XPP processor 10 must passes through the configuration memory, the SCM 5, the CMs 40, the SMs 43, and the configuration registers 51a to transmit the configuration bits from the configuration memory to the ALUs 51b of the PAEs 50. Thus, the XPP processor 10 has a complicated structure. Also, the SCM 5, the CMs 40, the SMs 43, and the configuration registers 5 la require storage spaces which can store the configuration bits. As a result, hardware overheads occur. In addition, the structure of the XPP processor 10 is hierarchical. Thus, in a case where the number or the structure of the PAEs 50 is changed, the tree structure including the SCM 5, the CMs 40, the SMs 43, and the configuration registers 51a must be modified. As a result, modification of the design of the tree structure is complicated, and thus its extension is diminished.