The present invention relates to a nested-loop data prefetching method, a processor and a program generating method. More particularly, the invention relates to a nested-loop prefetching method, a processor and a program generating method for nested loops in which the wait time caused by reference to a main memory can be reduced sufficiently even in nested loops in which the loop length of the innermost loops is short and the loop length of the outer loops is long.
In a computer, a cache memory having a higher speed than a main memory is disposed between a processor and the main memory so that recently referred data is placed on the cache memory to reduce the wait time caused by reference to the main memory.
In calculation using a large quantity of data, for example, such as numerical calculation, etc., however, cache miss occurs frequently because locality of reference to data is low. Accordingly, there arises a problem that the wait time caused by reference to the main memory cannot be reduced sufficiently.
To cope with such cache miss in such a large quantity of data, there has been proposed a prefetching method in which prefetch instructions for moving data from the main memory to the cache memory before use of the data is provided in the processor so that the prefetch instructions are inserted into a program by a compiler, as described in the paper, by T. C. Mowry et al, "Design and Evaluation of a Compiler Algorithm for Prefetching", Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pp.62-73, 1992.
Specifically, with respect to a loop 201 shown in FIG. 13A, an offset .alpha. between elements required for prefetching data is calculated on the basis of the number of cycles required for prefetching data from the main memory to the cache memory and the number of cycles in predicted execution of the loop. As represented by a loop 202 shown in FIG. 13B, prefetch instructions "PREFETCH" are first inserted so that data are prefetched by a loop precedent by the offset .alpha. to a loop using the data. However, data used in repetition by 1 to .alpha. times are not prefetched when nothing but the aforementioned countermeasure is done (.alpha. is a positive integer).
Further, in the last repetition by (N-.alpha.+1) to N times, only data not used in any arithmetic operation are prefetched (N is a positive integer indicating the number of repetitions for the innermost loop (loop length)).
Therefore, as shown in FIG. 13C, an .alpha.-times loop 203 for prefetching data used in repetition by 1 to .alpha. times is inserted before the start of the innermost loop. Further, the original loop 201 is split into a front half loop 204 for executing repetition by 1 to (N-.alpha.) times and a rear half loop 205 for executing the residual repetition by application of index set splitting so that no prefetch instruction is inserted into the rear half loop 205.
In the aforementioned prefetching method, cache miss is reduced so that the wait time caused by reference to the main memory can be reduced.
Incidentally, the essence of prefetching is in the loop 204 of FIG. 13C in which a movement of data from the main memory to the cache memory and an arithmetic operation are performed so as to overlap each other. If the value of the offset .alpha. is relatively large compared with the loop length N of the innermost loop, the percentage of the essential loop 204 becomes small and the percentage of the inessential loops 203 and 205 becomes large. Accordingly, there arises a problem that the wait time caused by reference to the main memory cannot be reduced sufficiently on a whole.
That is, conventionally, there was a problem that the wait time caused by reference to the main memory could not be reduced sufficiently in nested loops in which the loop length of the innermost loop was short and the loop length of the outer loops was long because only the innermost loop was a subject of application of the prefetching method.
FIG. 14 is a typical view showing a state of prefetching in the case where prefetching is performed by the conventional method using the innermost loop as a subject. Here, the loop length is N. In execution of the innermost loop, data to be referred to in the (1+.alpha.)-th repetition is prefetched in the first loop repetition (1401). Similarly, data to be referred to in the (N+.alpha.)-th repetition is prefetched in the N-th loop repetition. Prefetching for the (N+1)-th to (N+.alpha.)-th repetitions is, however, wasteful because the number of repetitions for the loop is N. Further, prefetching is not performed in the first to .alpha.-th repetitions.