FIELD OF THE INVENTION
The present invention relates to a compilation system for compiling memory access data at a high speed to advance an optimization effect of a memory access by a compiler having an optimization function. More particularly, the present invention relates to a compile process apparatus suitable for increasing an execution performance of a compiled program and for decreasing the number of memory accesses. This apparatus solves a confliction of data in a cache memory through analysis, when compiling the data.
The recently developed RISC processor is mainly composed of a superscalar processor, and is capable of performing more than one instruction per clock cycle. The size of a cache memory installed within a processor tends to increase year by year, requiring a high performance compiler for extracting optimum performance from the a RISC processor with such a cache memory. An optimization method considering the cache memory has recently been noted among various methods.
(Reference: The Cache Performance and Optimizations of Blocked Algogorithms, Monica S Lam 1991 ASPLOS-IV Proceedings).
Before explaining the conventional technology, the terminology related to the present invention is first explained.
1) Vector load/store
A general vector load/store loads and stores the data with a predetermined vector length as a group. The data subjected to being loaded/stored as a group with regard to a scalar calculation, comprises 16 bytes as a maximum, in an architecture capable of performing a quadruple-precision memory access.
2) Direct-mapped cache memory
The direct-mapped cache memory comprises a cache memory in which the correspondence between a main memory and the cache memory is determined. The remainder from the division of an address of the main memory by a cache size, corresponds to the address of the cache memory. Data whose address difference is an integer-multiple of the cache size cannot reside in the cache memory simultaneously, and conflict with each other, resulting in a decrease in processor performance.
3) N sets associative cache memory
This is a cache memory obtained by development of the direct-mapped cache memory. Simply speaking, this cache memory can be explained as N sets of the direct-mapped cache memory.
4) 64-byte cache line length
This refers to a method of performing a data transfer between a main memory and a cache memory in units of 64 bytes. When a first unit of data is stored in the cache memory, adjacent data is also stored in the cache memory together with the first unit of data. Thus the number of the memory accesses decreases when the line length increases, thereby speeding up the processor operation. However, in case of a direct-mapped cache memory whose size is 256 KB, data with an address difference of 256 KB+32 B cannot reside in the cache memory simultaneously, thereby causing a decrease in performance. Thus, it is not recommended to simply increase the line length. When the line length increases, the number of lines decreases and thus the possibility that the data different from each other by the cache size, which cannot be stored in the cache memory, increases. Thereby, the performance of processor operation decreases.
5) The number of entries in the cache memory.
The number of entries refers to how many lines are included within one set of the cache memory. In the case of a direct-mapped cache memory of 256 KB with a line length of 64 bytes, the number of entries in the cache memory is equal to 4096.
6) 8 bytes/16 bytes alignment
In the case where the area assigned to the memory access data is an area where the remainder obtained by dividing the actual address for the area by 8 or 16 is 0, the memory access data is called data having an alignment of 8 bytes data or 16 bytes data. When a RISC processor accesses an area for which the remainder obtained by dividing the actual address of the area by 8 or 16 is not equal to 0, an alignment error may occur.
7) Source information
When a compiler converts a program to an intermediate language, the program information stored at the same time is called the source information. The most general type of source information is a subscript of a matrix element. As the memory access is usually expressed by using a base register, an off-set and an index-register, the information of the subscript of the matrix element is meaningless when a hard-ware instruction is processed.
According to recent optimization technologies for a compiler, such as the blocking technology, the execution of a program is speeded up by maintaining data in a cache memory. However, the conventional optimization technology has the following drawbacks.
1) The cases to which these optimization technologies can be applied are rare, and thus these optimization technologies are not always effectively used in actual programs.
2) When these optimization technologies are not applied, cache misses occur frequently, thereby decreasing the performance.
3) According to another conventional technology, a space is inserted in the actual addresses for the memory access data, to prevent conflicts on the same line of the cache memory upon accessing the memory. As this is represented by a COMMON block in FORTRAN, it is sometimes impossible to insert a space in a continuous area, due to the language specification. In this case, a conflict occurs in the cache memory, thereby decreasing the performance.
A method for instruction scheduling for detecting the data subjected to a cache miss, and hiding the cache miss, is disclosed in the Japanese patent disclosure Hei 3-28273 publication (An Instruction Arrangement Optimization). This publication does not disclose a method of solving a cache miss.
A method of reducing the time of a memory accesses for a continuous area has the following problems.
4) The process of determining a continuous area and of accessing an area for a two time memory access at one-time by using a pair register, is not conducted by the present compile process apparatus.
5) In order to realize a reduction in the number of memory accesses for a continuous area, the loop of the program should be expanded several times to increase the continuous access, or to detect the continuous area. Such optimization is not used due to the complexity encountered upon expanding an instruction loop or the complexity of an interface.
6) When the double precision type data frequently used for FORTRAN is stored in a register, a 32 bit register is required. In this case, a 64 bit register and quadruple-precision memory access instructions are required to access the double precision type data at one time. Most of the current RISC processors are designed for 32 bit operation, and at the present stage of transferring from 32 bit to 64 bit RISC processors, implementation itself has not yet been realized.