A computer has multi-level hierarchical cache memories between a central processing unit (CPU) core and a main memory. It is tried to improve concealment of access latency and throughput insufficiency with respect to a lower-level cache memory closer to the main memory side among the multi-level hierarchical cache memories. Increasing the speed of a core and development of multi-cores in a CPU are in progress, and hit-rate improvement and concealment of cache-miss latency of the cache memory are needed.
As a solution for such problems, a prefetch is used that is to read data expected to be needed in the near future into a cache memory in advance, so as to decrease occurrence of cache miss. Methods of achieving the prefetch include a method by software (software prefetch) and a method by hardware (hardware prefetch).
The software prefetch is such that a compiler or a programmer explicitly inserts a prefetch instruction within instruction rows in advance, so as to perform the prefetch. Since the prefetch instruction is explicitly inserted within the instruction rows in advance, the conventional software prefetch is able to be controlled flexibly. However, the software prefetch has a difficulty in inserting a necessary prefetch instruction according to a dynamic behavior, such as a status of occurrence of cache hit or cache miss or an address calculation result.
Through the software prefetch, instructions for address calculation and prefetch instructions consume hardware resources related to issuance of instruction and execution of instruction. Thus, a performance overhead may occur. The software prefetch has a difficulty in inserting only one prefetch instruction at the minimum in units of cache lines for which the prefetch is performed. Thus, software prefetch has a problem of inserting a redundant prefetch instruction.
On the other hand, for the hardware prefetch, patterns of past memory access addresses and cache miss addresses are stored in hardware, regularities are extracted from past access patterns, and next access addresses are anticipated to perform the prefetch. For example, for the case of a continuous access in units of cache lines or a fixed-type stride access at a constant interval, there is a method to perform the hardware prefetch according to the access pattern thereof.
In the hardware prefetch, the more complicated the access pattern to be detected, the more complicated the algorithm involved and the higher the difficulty for achieving it become, or the more the circuit physical quantity and the power as hardware costs increase. The hardware prefetch operates according to an algorithm mounted in a hardware circuit, and thus a timing for issuing the prefetch and a prefetch distance indicating how far an address for which the prefetch is to be performed are fixed.
As an example of achieving the hardware prefetch of fixed-type stride access, there has been proposed a method of limiting the hardware prefetch to pattern detection between memory access instructions of the same program counter value, so as to reduce a difficulty in access pattern extraction. However, in order for the program counter value to be usable by the hardware prefetch control unit, a hardware cost for propagating the program counter value to the hardware prefetch control unit occurs. In a case where it is attempted to detect an access pattern for every program counter value, the number of entries of tables to be retained becomes large. Particularly when loop unrolling is performed to expand loop processing as part of software optimization, the number of series and a stride width tend to be large, and it is conceivable that the hardware prefetch cannot be generated effectively.
There has been proposed a technique to activate a hardware prefetch circuit by one software prefetch instruction specifying with instruction codes the size of a block, the number of blocks, and the stride width between blocks of a cache memory, so as to perform a plurality of prefetches (see, for example, Patent Documents 1, 2). This technique allows reducing the influence or consuming hardware resources related to instruction issuance and instruction execution by the software prefetch instruction. However, the stride width and so on are specified with the instruction codes, and consequently the instruction codes are consumed conversely.
With respect to the hardware prefetch, there have been proposed techniques to add hint information to inhibit registration in a prefetch address queue and issuance of hardware prefetch to the memory access instruction, so as to prevent generation of an unnecessary hardware prefetch (see, for example, Patent Documents 3, 4). There has been proposed a technique to allow setting, from the operating system, prohibition or authorization of prefetch based on behaviors when each application is executed (see, for example, Patent Document 5). There has been proposed a technique to specify by a prefetch hint instruction or the like an access pattern or the like to be detected by the hardware prefetch, so as to control operation of the hardware prefetch (see, for example, Patent Document 6).
[Patent Document 1] U.S. Pat. No. 6,578,130
[Patent Document 2] U.S. Pat. No. 6,915,415
[Patent Document 3] U.S. Pat. No. 3,166,250
[Patent Document 4] European Patent Application Publication No. 2204741
[Patent Document 5] U.S. Pat. No. 7,318,125
[Patent Document 6] U.S. Pat. No. 7,533,242
In a conventional hardware prefetch, a timing for issuing a prefetch and a prefetch distance are set in hardware and hence are fixed.