The present invention relates to a compiling method capable of reducing execution time of an object program in computer utilizing techniques. More specifically, the present invention is directed to an optimizing-designation method used when a source program is compiled with respect to an architecture equipped with a plurality of memory hierarchies.
With improvements in operating speeds of microprocessors, latencies of main storage accesses are increased. Most of the current processors is provided with cache memories having relatively small memory capacities, the access speeds of which are faster than those of the main storage. Furthermore, in some processors, cache memories are constituted in such a hierarchical form as a primary cache and a secondary cache. Since memories are formed in the hierarchical form, data of such a memory hierarchy whose latency is small is accessed by a processor as many as possible, so that a total number of accesses by the processor with respect to data of a memory hierarchy whose latency is large can be reduced. In other words, a processor which executes a memory access instruction can access a primary cache in a short latency when the data hits the primary cache. When the data misses the primary cache, this processor accesses a second cache. When the data hits the secondary cache, the processor can subsequently access data thereof in a short latency. Only when the data misses all of the cache hierarchies, the processor accesses the main storage.
As a method of capable of hiding (namely, not revealing) a latency during cache miss operation, an instruction scheduling method may be employed which executes a process in such a manner that a distance between a load instruction and an instruction (will be referred to as “use instruction” hereinafter) for using the loaded data is made sufficiently longer. A memory latency is employed as a reference for determining how many cycles the distance between both instructions is separated.
A method for “hiding” a latency of referring to the main storage during cache miss operation is described in, for instance, Todd C. Mowry et al, “Design and Evaluation of Compiler Algorithm for Prefetching”, Architectural Support for Programming Languages and Operating Systems, pp. 62 to 73 in 1992 (will be referred to as “Publication 1” hereinafter). Publication 1 discloses a so-called “software prefetching (prefetch optimization)”. In this software prefetching method, while a prefetch instruction is prepared for a processor and this prefetch instruction instructs to move data from the main storage to a cache in a preceding manner, a prefetch instruction is inserted into an object program by a compiler. If the prefetch instruction is utilized, then the latency of referring to the main storage can be “hidden”. That is, while data to which a processor refers in a succeeding loop iteration is previously moved from the main storage to the cache, another calculation can be carried out by this processor at the same time if this prefetch instruction is utilized.
In this software prefetching method, when a prefetch instruction is produced with respect to data reference by the processor within a loop, first of all, the number of execution cycles “C” required for one iteration of this loop is estimated. Next, calculation is made of such a value α=CEIL (L/C) which is defined by dividing the number of cycles “L” by “C”. This cycle number “L” is required in order that data is moved from the main memory to a cache (memory latency). Symbol “CEIL” is assumed as a symbol which indicates rounding up any numbers smaller than, or equal to a decimal point. Since the data to which the processor refers after “α” times of loop iterations has been previously prefetched, when the processor refers to this data after “L” cycles, the data has already been reached to the cache, so that the processor hits the cache and can execute the program at a high speed. In a case where data is prefetched to a primary cache, if the data has already been stored in the primary cache, then the prefetching operation is no longer required. Also, in a case where data is present in the secondary cache, the number of cycles which requires to move the data from the secondary cache to the primary cache may be used as the memory latency “L”, whereas in a case that data is present in the main storage, the number of cycles which requires to move the data from the main storage to the primary cache may be used the memory latency “L”. However, it is normally unclear which memory hierarchy data is present in. As a consequence, assuming now that the data is present in the main storage, process is carried out.
Another memory optimizing method is known which can reduce the number of cache misses by way of a program transformation capable of improving a data locality. As a specific program transformation, there are proposed: a loop tiling method, a loop interchanging method, and a loop unrolling method.
The loop tiling method corresponds to a loop transformation operation having the following purpose. That is, in a case where data to which a processor refers within a multiply nested loop owns a reuse, it is so designed that the processor again refers to data which has once been loaded on a cache before this loaded data is ejected from the cache since the processor refers to another data. The loop tiling method is described in Michael Edward Wolf, “Improving Locality and Parallelism in Nested Loops”, Technical Report: CSL-TR-92-538, in 1992 (will be referred to as “Publication 2” hereinafter).
The loop interchange method and the loop unrolling method, which aim to optimize the memory reference pattern, are described in Kevin Dowd, “High Performance Computing”, O'Reilly & Associates, Inc., 11.1 section (will be referred to as “Publication 3” hereinafter).
In order to realize the above-described latency hiding optimization and also the above-explained data localization, such information is required which may depend upon attributes of a target machine, for instance, the number of cycles required to memory reference, and a cache size. Normally, information as to a target machine is held as internal information in a compiler. There is another method for instructing the above-described information by a user. In such a publication, i.e., Hitachi, Ltd. (HI-UX/MPP for SR8000) “Optimizing FORTRAN90 User's Guide”, 6.2 section (will be referred to as “Publication 4” hereinafter), the following aspect is described. That is, the user can designate that the number of cycles required to read from the memory is “N” by making such option designation as “-mslatency=N” (symbol “N” being positive integer). In a publication, i.e., IBM, “XL Fortran for AIX User's Guide Version 7 Release 1”, Chapter 5 (will be referred to as “Publication 5” hereinafter), the following aspect is described. That is, the user can designate the cache size, the line size, and the associative number every level of the hierarchical cache by making the “-qcache” option.
With respect to the conventional latency hiding optimization and the conventional data localization, there are different optimizing methods, depending upon such a condition that data is located in which memory hierarchy when an object program is executed.
For example, in the instruction scheduling method, if the distance between a load instruction and a use instruction is increased, then the total number of registers to be used is also increased. Also, in the prefetch optimizing method, when the timing of the prefetch instruction becomes excessively early, there are some possibilities that the data is again ejected from the cache before the use instruction is carried out. As a result, when the memory latency becomes excessively large which is assumed in these optimizing methods, sufficiently effects achieved by these optimizing operations cannot be realized. In other words, when the optimizing operation using the main storage latency is applied to the data which hits the L2 cache (secondary cache), there are certain possibilities that the execution performance thereof is lowered, as compared with that for a case where the optimizing operation using the L2 cache latency is applied to the data. However, since it has not clearly been defined in the prior art which memory hierarchy the subject data of the load instruction is located in, the following problem occurred. That is, the main storage latency had to be assumed to be used when the optimizing operation was applied.
Also, in the data locality optimizing method, if the loop structure is converted into a complex loop structure, overheads of the loop execution will be increased. As a result, there are possibilities that the execution performance is lowered. In a case where the data to which the processor refers within the loop mainly causes the cache miss, the effect may be achieved by applying the data locality optimizing method since the total number of the cache misses is reduced. However, when the cache hits occur, since there is no effect achieved by reducing the total number of the cache misses, it is better not to apply the data locality optimizing method. Since it could not grasp as to whether or not the cache hit occurs in the prior art, the loop transformation has been applied even when the cache hits occur. As a consequence, there has been a problem that the execution performance may be lowered.