The present invention relates generally to the field of data cache memories, and more specifically to an apparatus and method for prefetching data into a cache memory.
The recent trend in computer systems has been toward faster and more efficient microprocessors. However, the speed with which the processors are able to access their related memory devices has not increased at the same rate as the processors' execution speed. Consequently, memory access delays have become a bottleneck to increasing overall system performance.
Generally, the faster data can be retrieved from a memory device, the more expensive the device is per unit of storage. Due to this cost, it is not feasible to have enough register (i.e., fast memory device) capacity in the microprocessor's on-chip main memory to hold all of the program instructions and data needed for many applications. Consequently, most of the data and instructions are kept on large, relatively slow storage devices. Only the instructions and data that are currently needed are brought into registers.
To reduce the time it takes to retrieve data from the slower bulk storage memories, specialized memories are placed between the registers and the bulk storage devices. These memories are known as cache memories in the industry. Cache memories exploit the "principle of locality," which holds that all programs favor a particular segment of their address space at any instant in time. This hypothesis has two dimensions. First, locality can be viewed in time (temporal locality,) meaning that if an item is referenced, it will tend to be referenced again soon. Second, locality can be viewed as locality in space (spatial locality,) meaning that if an item is referenced, nearby items will also tend to be referenced. By bringing a block or subblock of data into the cache when it is referenced, the system can take advantage of both of these principles to reduce the time it takes to access the data the next time it is referenced.
Data may be brought into the cache as it is requested, or sometimes before it is requested. If data is brought into the cache memory before it is requested, it is said to be "prefetched" Prefetching may be initiated by software or hardware. In software prefetching, the compiler inserts specific prefetch instructions at compile time. The memory system retrieves the requested data into the cache memory when it receives a software prefetch instruction, just as it would for a normal memory request. However, nothing is done with the data beyond that point until another software instruction references the data.
Hardware prefetching dynamically decides during operation of the software which data will most likely be needed in the future, and prefetches it without software intervention. If it makes the correct decision on what data to prefetch, the data is ready when the software requests it. Decisions on when to prefetch are often made with the assistance of a history buffer. A history buffer retains information related to individual software instructions. It maintains a set of entries cataloguing what has taken place in previous iterations of the instructions.
Each method has its advantages and disadvantages. Software is often more efficient in deciding when to prefetch data. However, extra instruction cycles are required to execute the prefetch instructions. On the other hand, hardware may make more mistakes in deciding when to prefetch, but does not require the extra instruction cycles. Hardware prefetching is also often advantageous to speed up old codes/binaries that were not compiled with software prefetching.
Another architectural feature implemented in some of today's microprocessor architectures is the use of multiple caches. FIG. 1 is a diagram showing some previously known uses of multiple caches in a memory system 105. A processor 110 is connected to registers within a main memory 120. Processor 110 has direct access to the registers. If an instruction or data is needed by processor 110, it is loaded into the registers from a storage device 125.
Multiple caches may be placed between storage device 125 and main memory 120 in a variety of ways. For example, two caches may be placed hierarchically. In modern processors, it is common to have a first level of cache, L1 cache 140, on the same integrated circuit as the processor and main memory 120. A second level of cache, L2 cache 150, is commonly located between L1 cache 140 and storage device 125. Generally, L1 cache 140 is more quickly accessible than L2 cache 150 because they reside on the same integrated circuit.
Another way that multiple cache systems are implemented is with parallel caches. This allows multiple memory operations to be done simultaneously. A second cache, L1 cache 142, is located in parallel with L1 cache 140 at the first level. In some applications, L1 cache 142 is a specialized cache for fetching a certain type of data. For example, first L1 cache 140 may be used to fetch data, and second L1 cache 142 may be used to fetch instructions. Alternatively, second L1 cache 142 may be used for data that is referenced by certain instruction that commonly reuse the same data repeatedly throughout a calculation. This often occurs with floating point or graphics operations.
Another approach for using parallel caches is taught in commonly assigned U.S. Pat. No. 5,898,852, issued Apr. 27, 1999 entitled "Load Steering for Dual Data Cache", which is incorporated herein by reference for all purposes. It teaches the use of first L1 cache 140 as a standard data cache and second L1 cache 142 as a prefetch cache for prefetching data as described above.
Additional hardware features may also be included in a cache system to increase the performance of the system. A translation lookaside buffer (TLB) 160 may be added to speed up the access to storage device 125 in the case of a cache miss. Generally, processor 110 references an item of data by a virtual address. A line of data in the cache may be referenced by a tag that is related to the virtual address. However, the data is stored on storage device 125 according to a physical address. If a cache miss occurs, a translation must be done by cache miss handling logic (not shown) to calculate the physical address from the virtual address. This translation may take several clock cycles and cause a performance penalty. TLB 160 is used to hold a list of virtual to physical translations, and if the translation is found in the TLB, time is saved in subsequent accesses to the same data.
A limitation of currently available devices is that an instruction directed toward a parallel cache, such as L1 cache 142, that causes a cache miss causes significant delays. These delays occur because the instruction must be recycled to the main cache system for determining the physical address.
Consequently, it is desirable to provide an improved apparatus and method for implementing parallel caches that reduces the instances that instruction recycling must occur, and for deciding when and what type of instructions to send to the parallel cache. Further, it is desirable to provide an improved architecture and method for prefetching data into a cache memory.