The continuing development of computer systems has resulted in efforts to increase performance and maximize efficiency of the computer systems. One solution to this problem has been the creation and utilization of cache systems in a computer. The purpose of a cache system is to bring the speed of accessing computer system memory as close as possible to the speed of the central processing unit (CPU) itself. By making instructions and data available to the CPU at a rapid rate, it is possible to increase the performance rate of the processor. A cache system has access time that approaches that of CPU components, and is often 5 to 10 times faster than the access time of main memory components. When the CPU makes a data request, the data can be found in one of the processor caches, main memory, or in a physical storage system (such as a hard disk). Each level consists of progressively slower components. There are usually several levels of cache. The L1 cache, which usually exists on the CPU, is the smallest in size. The larger L2 cache (second-level cache) may also be on the CPU or be implemented off the CPU with SRAM. Main memory is much larger and consists of DRAM, and the physical storage system is much larger again but is also much, much slower than the other storage areas. Cache memories are fast memory storage devices. A cache system increases the performance of a computer system by predicting what data will be requested next and having that data already stored in the cache, thus speeding execution. The data search begins in the L1 cache, then moves out to the L2 cache, then to DRAM, and then to physical storage.
A process known as “prefetching” is known in the art. Prefetching is used to supply memory data to the CPU caches ahead of time to reduce microprocessor access time. By fetching data from a slower storage system and placing it in a faster access location, such as the L1 or L2 cache, the data can be retrieved more quickly. Ideally, a system would prefetch the data and instructions that will be needed next far enough in advance that a copy of the data that will be needed by the CPU would always be in the L1 cache when the CPU needed it. However, prefetching involves a speculative retrieval of data that is anticipated to be needed by the microprocessor in subsequent cycles. Data prefetch mechanisms can be software controlled by means of software instructions, or hardware controlled, using pattern detection hardware. Each of these prefetch mechanisms has certain limitations.
Software prefetch mechanisms typically use instructions such as Data Stream Touch (DST) to prefetch a block of data. Once the prefetch is started by the software command, hardware is used to prefetch the entire block of data into the cache. If the block of data fetched is large relative to the size of the L1 cache, it is probable that data currently being used by the CPU will be displaced from the L1 cache. The needed displaced lines will have to be refetched by the CPU, resulting in slower performance. In addition, software prefetch instructions may generate access patterns which do not efficiently use caches when prefetching larger lines, such as 128 bytes. For example, a DST instruction can specify a starting address, a block size (1 to 32 vectors, where a vector is 16 bytes), a number of blocks to prefetch (1 to 256 blocks), and a signed stride in bytes (−32768 to +32768). An access pattern which specifies blocks which span cache lines and are irregularly spaced, relative to the cache lines, will waste cache space. And, due to the sparse use of the data in the cache line, performance will be lowered. Additionally, large amounts of hardware may required to implement the full scope of the software prefetch instruction.
Hardware mechanisms prefetch a stream of data and generally can be designed to only prefetch as far ahead as the cache and memories require. Because hardware mechanisms detect a stream, the stream logic has to generate enough prefetches to get the designated number of lines ahead of the actual processor accesses. Once the hardware is far enough ahead, the lines are prefetches at the rate at which the processor consumes them. Often, however, especially when a hardware prefetch is first started, there is a delay in the prefetch process, because the hardware has to detect the access pattern before it can start prefetching. Additionally, if the hardware does not know the length of the access pattern, it can fetch beyond the end of the required data block. These inefficiencies are amplified when the data stream being prefetched is a short stream. The wasted memory bandwidth due to unused fetches becomes a larger problem in systems that prefetch data from a plurality of L1 and L2 caches, as is becoming more common in larger, faster systems having multiple processors.
Therefore, what is needed is a system and method of efficiently utilizing prefetch logic so as to maximize CPU performance without requiring additional hardware.