1. Field of the Invention
Methods and apparatuses consistent with the present invention relate to high performance memory systems, and more particularly, to efficiently performing calculation of a memory bank and an offset of an array variable included in a loop.
2. Description of the Related Art
Technologies using memory parallelism are vigorously proposed as a method of realizing a high performance memory system. The memory parallelism is generally performed by increasing a number of memory data capable of being simultaneously accessed using memory interleaving in a system having multiple memory banks. The memory interleaving is a technology for improving access performance by distributing data to the multiple memory banks so that parallel access to the multiple memory banks is possible. In this case, a number of memory banks used in interleaving is designated as an interleaving factor (IF).
FIGS. 1A and 1B illustrate loop unrolling and memory interleaving. Referring to FIG. 1A, a memory includes four banks. Accordingly, data stored in a memory bank 0, a memory bank 1, a memory bank 2, and a memory bank 3 may be simultaneously accessed by a processor. Therefore, in comparison to a case where the memory interleaving is not used, a memory access speed may improve as much as four times. Since a number of the memory banks used in the memory interleaving in FIGS. 1A and 1B is four, an interleaving factor (IF) is four.
Also, as shown in FIGS. 1A and 1B, in the case of an array used in a loop, an effect of the memory parallelism may be improved by performing the loop unrolling. The loop unrolling is a method of reducing a number of iterations of the loop by copying a body part of the loop several times so that the copied body parts may be performed at same time. In FIG. 1A, the loop is repeated 32 times. However, in FIG. 1B, an original code 110 shown in FIG. 1A is converted into a new code 120 by the loop unrolling. In the case of the new code 120, a total calculation is finished when the loop is repeated eight times. Due to the memory interleaving, since only one loop may be performed at a time, the new code 120 may be more quickly performed than the original code 110. An array element calculation in a loop included in the original code 110 is reproduced as four array element calculations in the new code 120. As described above, a value associated with how many times array element calculations included in the loop are reproduced in the loop unrolling is designated as an unrolling factor (UF). In FIG. 1B, since an operation “A[i]=”, which is included in the loop, is unrolled to four operations as “A[i+0]=”, “A[i+1]=”, “A[i+2]=”, and “A[i+3], the unrolling factor is four.
In FIGS. 1A and 1B, to locate an A[i] value, the memory bank storing the A[i] value and an offset in the memory bank have to be calculated. Generally, the memory bank may be obtained by modulo operation of an index of an array by a number of the memory banks. For example, in order to detect the memory bank in which A[7] is located, 7 (an index of A[7]) is used for a modulo operation along with the number of the memory banks, which is 4 in a given example. Then, a result value becomes 3, and A[7] is located in the memory bank 3. Also, where, in the memory bank 3, A[7] is located has to be determined. Generally, the offset in the memory bank may be obtained by dividing the index of the array by the number of the memory banks. For example, when dividing seven (the index of A[7]) by 4 (the number of the memory banks in FIGS. 1A and 1B), since a quotient is 1, the offset in the memory bank 3 of A[7] becomes 1.
As described above, an array used in a loop has an overhead of calculating the memory bank to be accessed and corresponding offsets, every time. In a conventional architecture, the memory address calculation is performed in real time by a software in a processor or by a special-purpose hardware, for example, an address calculation unit. However, when using the software, a performance speed is slow, and when using the special-purpose hardware, a hardware cost is high. Also, calculation costs of a modulo operation and a division operation used for calculating the memory address in memory interleaving are high regardless of using the special-purpose hardware or the software.
Particularly, in the case of a reconfigurable architecture previously proposed, since memory address calculation is directly mapped to a reconfigurable hardware, a cost of hardware for the memory address calculation is very high. The reconfigurable architecture can be customized to solve any problem after device fabrication or can exploit a large degree of spatially customized calculations in order to perform their calculation. A field programmable gate array (FPGA) that includes lines for connecting a plurality of arithmetic logic units (ALUs), may embody the reconfigurable architecture. For example, if the FPGA is customized to be capable of calculating an operation “A*x*x+B*X+C”, the operation may be repeated very quickly. Accordingly, the reconfigurable architecture is very capable for processing a loop operation. Also, the lines connecting the ALUs may be changed in a configuration by applying a certain current. As described above, an architecture that can perform a new operation by changing a hardware configuration after fabrication is designated as a reconfigurable architecture. A reconfigurable architecture in which data is inputted to one array element one bit at a time is designated as a fine grained array (FGA). A reconfigurable architecture in which data is inputted to one array element one word at a time is designated as a coarse grained array (CGA).
Accordingly, when processing an array in a loop, a method for efficiently calculating a location and/or position in a memory in which the array is stored is required. The term position is used herewith throughout the specification. The term position also encompasses a location in a memory.