The present invention relates in general to a compiler which is designed in such a way as to take aim at a processor including a cache memory, and more particularly to a memory access optimizing method of optimizing a load system for referring to elements of an array.
A latency for an access to a main storage has been lengthened along with increasing of the processing speed of a microprocessor so that the influence has been increased which is exerted on the execution performance of a program. Many processors are provided with cache memories each of which has a quicker access than that of a main storage and each of which has a relatively small capacity to reduce the number of accesses to a main storage having a long latency. In other words, a memory access instruction accesses a cache memory with a short latency in cache hit, while accesses a main storage only in cache miss.
As for one method of hiding a latency required for an access to a memory in cache miss, there is prefetching (i.e., software prefetching). The prefetch instruction loads data on a main storage into a cache memory with non-blocking. A compiler has previously issued the prefetch instruction by employing the scheduling technique such as the software pipelining so that the different arithmetic is executed for a period of time until the prefetch has been completed. Thereafter, data on the cache memory is accessed by a load instruction. By adopting this method, it is possible to hide the latency resulting from a memory access. Such a prefetch method, for example, is described in an article of Todd C. Mowry et al.: xe2x80x9cDesign and Evaluation of Compiler Algorithm for Prefetching, Architectual Support for Programming Languages and Operating Systemsxe2x80x9d, pp. 62 to 73, 1992 for example.
As for another mechanism for hiding a latency required for a memory access, there is a method called the preload. This method is described in an article of Sawamoto et al.: xe2x80x9cData Bus Technology of RISC-based Massively Parallel Supercomputerxe2x80x9d, Journal of IPSJ (Information Processing Society of Japan), Vol. 38, No. 6, pp. 485 to 492, 1997. The preload instruction writes directly data on a main storage to a register by bypassing a cache memory. In the compiler, by employing the scheduling technique such as the software pipelining the preload instruction and the arithmetic instruction using the data are separated from each other by equal to or longer than a memory latency, thereby hiding the memory latency.
The preload has the following advantage as compared with the prefetch. That is, since the data can be loaded from the main storage into the register in accordance with one preload instruction, any of the instructions does not need to be added as in the prefetch and hence the number of issued instructions does not increase. In addition, since no data is written to the cache memory, the memory throughput is excellent. Also, while the prefetch may be driven out from the cache memory before having used the data in some cases, the preload writes directly the data to the register and this is free from such anxiety.
On the other hand, the prefetch has the following advantage. That is, since the prefetch instruction does not occupy the register, this does not increase the register pressure. In addition, in accordance with the prefetch instruction, the data for one cache line are collectively written to the cache memory in response to one memory request, and the data of interest can be effectively utilized in the operation of accessing the continuous data.
The architecture described in the above-mentioned article by Sawamoto et al. includes the mechanisms of both of the prefetch and the preload. Therefore, the preload can be applied to the floating-point data, while the prefetch can be applied to both of the fixed-point data and the floating-point data. Then, it is described in the above-mentioned article by Sawamoto et al. to generate the code with which the fixed-point data is loaded into the cache memory by the prefetch, while to load directly the floating-point data is directly loaded into the register by the preload. However, in the above-mentioned article by Sawamoto et al., it is not described at all to use the two methods, i.e., the prefetch and the preload appropriately for the floating-point data within one loop.
With respect to the prefetching, there has been studied the method wherein it is analyzed whether or not the prefetch is necessary for each of the memory accesses to delete any of the redundant prefetch instructions. This method is described in the above-mentioned article by Mowry et al. This method is based on the reuse analysis of the loop nest. In this connection, it is said that when the data in the same cache line is; referred to equal to or larger than two times, there is the reuse. The reuses are usually classified into the self-reuse wherein the same cache line among the different loop iterations is accessed on the basis of one reference, and the group reuse among a plurality of references. In the reuse analysis, the subscript expression of an array is expressed in the form of the linear expression of the loop control variables to carry the analysis thereof. The method of the reuse analysis is described in detail in an article of M. E. Wolf and M. S. Lam: xe2x80x9cA Data Locality Optimizing Algorithm, Programming Language Design and Implementationxe2x80x9d, pp. 30 to 44, 1991. In the deletion of any of the redundant prefetch instructions, attention is paid to the group reuse. It is assumed that of a plurality of references having the group reuse, the reference with which new data are initially preferred to is the leading reference. Since the prefetch is applied to the leading reference, and with respect to other data, the data are utilized which have been written to the cache memory by the prefetch, the prefetch is omitted for other data. In such a way, the prefetch instruction is issued to only the necessary data.
The optimization of carrying out the reuse analysis between the loops to delete any of the redundant prefetches is described in an article of Keith Cooper et al.: xe2x80x9cCross-loop Reuse Analysis and its Application to Cache Optimizations, Workshop on Languages and Compilers for Parallel Computingxe2x80x9d, pp. 1 to 15, 1996. In accordance with this article, the reference parts of the array within each of the loops are obtained to propagate the reference parts thus obtained in the form of data flow, thereby obtaining the data sets which have arrived at the loop entry and the loop exit, respectively. For the data which have arrived at both of the loop entry and the loop exit, the prefetch is made unnecessary to be deleted.
As described above, each of the prefetch and the preload has both of the advantage and the disadvantage. As in the past, in the method of prefetching all of the data or preloading all of the data, the disadvantage thereof appears. Then, the more suitable method is selected between the prefetch and the preload in accordance with the characteristics of the memory references to generate the code in which both of the prefetch and the preload are used together with each other, whereby it is possible to utilize the advantages of both of the methods.
An object of the present invention is to provide an optimizing method wherein for a memory reference to which both of two methods, i.e., preload and prefetch can be applied, an access method which is more suitable for that reference is selected to generate a code in which both of the preload and the prefetch are used together with each other, thereby generating a code having higher execution performance.
The object of the present invention is attained by providing: a memory access method judgement step of determining which access method of prefetch, preload or load is selected for each of memory references; a preload optimization step of carrying out the optimization for the memory access which has been judged to be the preload to generate a preload code; and a prefetch optimization step of generating a prefetch code for the memory access which has been judged to be the prefetch.
A first method of the memory access method judgement step includes: a step of analyzing whether or not the description of a source program or the designation of a memory access method by a compiler option is present for the memory access; and a step of determining the memory access method in accordance with the analysis result.
A second method of the memory access method judgement step includes: a step of judging whether or not data have already been present on a cache memory; a step of judging the competition of those data with other data for cache; a step of judging whether or not the data will be referred to later again; and a step of judging whether or not the restriction on register resources is fulfilled. The step of judging whether or not data have already been present on a cache memory and the step of judging whether or not the data will be referred to later again include the analysis relating to the intraloop and the analysis relating to the interloop, respectively.