1. Field of the Invention
This invention pertains generally to software prefetching algorithms. More particularly, the invention is a heuristic algorithm for identifying memory operations guaranteed to hit in the processor cache.
2. The Prior Art
Current computer systems include, among other things, a memory system and a processing unit (or processor or central processing unit (CPU)). A memory system serves as a repository of information, while the CPU accesses information from the memory system, operates on it, and stores it back.
However, it is well known that CPU clock speeds are increasing at a faster rate than memory speeds. When a processor attempts to read a memory location from the memory system, the request is xe2x80x9cvery urgentxe2x80x9d. That is, in most computer systems, the processor stalls or waits while the memory system provides the data requested to the CPU. The xe2x80x9clatencyxe2x80x9d of the memory is the delay from when the CPU first requests data from memory until that data arrives and is available for use by the CPU.
A cache is a special high-speed memory in addition to the conventional memory (or main memory). FIG. 1 depicts a conventional hierarchical memory system, where a CPU is operatively coupled to a cache, and the cache is operatively coupled to the main memory. By placing the cache (very fast memory) in front of the main memory (large, slow memory), the memory system is able to satisfy most requests from the CPU at the speed of the cache, thereby reducing the overall latency of the system.
When the data requested by the CPU is in the cache (known as a xe2x80x9chitxe2x80x9d), the request is satisfied at the speed of the cache. However, when the data requested by the CPU is not in the cache (known as a xe2x80x9cmissxe2x80x9d), the CPU must wait until the data is provided from the slower main memory to the cache, and then to the CPU, resulting in greater latency.
To address the problem of latency and to increase the xe2x80x9chitxe2x80x9d to xe2x80x9cmissxe2x80x9d ratio associated with cache memory, many modern computer systems have introduced instructions for prefetching data from memory to cache. For example, instructions set architectures (ISA""s), such as SPARC(trademark) V9, support software data prefetch instructions. The details of the implementing prefetch instructions have been left to the designers of optimizing compilers to find ways to reduce the frequency of cache misses.
However, prefetch instructions are costly to execute. Compiler algorithms, which typically implement prefetch insertion for code optimization, should also consider whether insertion of a prefetch for a given memory operation likely to miss the cache would be profitable. For example, in certain processor architectures, the number of in-flight memory operations is typically limited by the size of a xe2x80x9cmemory queuexe2x80x9d. Since prefetch instructions occupy space in the memory queue, it would be advantageous to avoid scheduling/inserting prefetches unnecessarily.
Furthermore, in certain processor chip designs, a limited number of memory operations may be issued during an operation cycle. For example, in the ULTRASparc III chip, only one memory operation may be initiated during a given operation cycle. Thus, it would be beneficial to minimize unnecessary memory prefetch scheduling/insertion during program compilation.
To overcome these and other deficiencies found in the prior art, disclosed herein is a heuristic algorithm which identifies loads guaranteed to hit the processor cache which further provides a xe2x80x9cminimalxe2x80x9d set of prefetches which are scheduled/inserted during compilation of a program. The invention further relates to machine readable media on which are stored embodiments of the present invention. It is contemplated that any media suitable for retrieving instructions is within the scope of the present invention. By way of example, such media may take the form of magnetic, optical, or semiconductor media.
The present invention also relates to a method and use of prefetch instructions to load data from memory into a cache. It is contemplated that the invention may be used for loading data from conventional main memory as well as other xe2x80x9cslowxe2x80x9d data storage structures such as a disk storage or a network storage, for example. Although, the invention is described herein with respect to a single cache, it is contemplated that any suitable cache arrangement (e.g., various levels of cache) is within the scope of the present invention.
In its most general terms, the invention comprises software for scheduling memory operations to provide adequate prefetch latency. The invention is generally used in conjunction and incorporated into compilation software (compiler), which converts source code into a compiled program (or executable file). During compilation, the source code is converted into an intermediary xe2x80x9cprogram codexe2x80x9d which is processed by the compiler. After the compiler has completed processing the program code, a compiled program is generated from the program code.
More particularly, the invention is embodied in a heuristic prefetch scheduler component having a program code parser module and a prefetch scheduler module.
The program code parser module first examines the program code to sort memory operations (such as loads) into two groups: loads that are likely not to miss the cache and loads that are likely to miss the cache. Various algorithms known in the art may be used for carrying out this classification. For the group of memory operations likely to miss the cache, the program code parser module then creates sets of offsets corresponding to xe2x80x9crelatedxe2x80x9d memory operations, each related memory operation including a base address and an offset value (i.e., a constant) which may be zero, and each related memory operation further associated with a prefetch instruction.
The prefetch scheduler module operates on the sets of offsets established by the program code parser module to generate a xe2x80x9cminimalxe2x80x9d number of prefetches associated therewith. The heuristic algorithm of the present invention, which is carried out by the prefetch scheduler module, utilizes the concept of a xe2x80x9ccache linexe2x80x9d. When data is retrieved from memory, the data is retrieved in chunks called cache lines, as is known in the art. The size of the cache line varies from platform to platform. However, in general, the size of the cache line is larger than the size of the data requested during a memory instruction. Accordingly, there may be cases where two or more related memory instructions may reference data in the same cache line, in which case, it may be unnecessary to prefetch data for some memory instructions when an earlier prefetch for a xe2x80x9crelatedxe2x80x9d memory instruction has already been carried out. The present invention provides a heuristic algorithm for determining which prefetches are unnecessary for such related memory operations; thus, generating a minimal number of prefetches for related memory operations.
In particular, the prefetch scheduler module sorts the set of offsets established by the program code parser module, typically in ascending order. The prefetch scheduler then generates a prefetch for the lowest (minimum) offset and the highest (maximum) offset. The prefetch scheduler then iterates through the set to determine whether a given offset in the set requires a prefetch. According to the algorithm of the present invention, a prefetch is created for a given (current) offset if the distance from the current offset exceeds the previously generated offset by a threshold value. The threshold value is a heuristically generated arbitrary value and is typically one-fourth (xc2xc) to one-half (xc2xd) of the cache line size, although any threshold value not exceeding twice the cache line size or twice the prefetch data size is contemplated for use with the present invention. A threshold is used because the run-time location of the memory address within a cache line is not usually predictable.
According to an alternative embodiment of the algorithm, prefetches are generated at regular intervals of one cache line size between the minimum and maximum offset values (ignoring what the actual intervening offsets are).
The process of the program code parser module and the prefetch scheduler module together identifies loads guaranteed to hit the processor cache (i.e., those memory operations related to other memory operations are xe2x80x9cguaranteedxe2x80x9d to hit the cache because the prefetches associated with the earlier memory operations bring the data required by the later related memory operations into the cache line).
It is noted that the present invention is suitable for use during one or more processes during the compilation process to optimize the program code.