1. Field of Invention
This invention relates to prefetching data from computer memory. Specifically, this invention relates to a technique of improving data bandwidth of the processing unit of a computer by using information extracted from previous data load instructions to predict which data will be requested in subsequent instructions.
2. Description of Related Art
Modern computer systems utilize a hierarchy of memory elements in order to realize an optimum balance between the speed, size, and cost of computer memory. Most of such computer systems employ one or more DRAM arrays as primary memory and typically include a larger, but much slower, secondary memory such as, for instance, a magnetic storage device or CD ROM. A small, fast SRAM cache memory is typically provided between the central processing unit (CPU) and primary memory. This fast cache memory increases the data bandwidth of the computer system by storing information most frequently needed by the CPU. In this manner, information most frequently requested during execution of a computer program may be rapidly provided to the CPU from the SRAM cache memory, thereby eliminating the need to access the slower primary and secondary memories. Although fast, the SRAM cache memory is very expensive and should thus be of minimal size in order to reduce cost. Accordingly, it is advantageous to maximize the frequency which with information requested by the CPU is stored in cache memory.
FIG. 1 is an illustration of a general purpose computer 10 including a CPU 12 having an on-board, or internal, cache memory 14. Typically, the internal cache 14 is divided into an instruction cache (I$), in which the most frequently requested instructions are stored, and a data cache (D$), in which the most frequently requested data is stored. The computer also includes an external cache (E$) 16 and a primary memory 18. During execution of a computer program, the computer program instructs the CPU 12 to fetch instructions by incrementing a program counter within the CPU 12. In response thereto, the CPU 12 fetches the instructions identified by the program counter. If the instruction requests data, an address request specifying the location of that data is issued. The CPU 12 first searches the internal cache 14 for the specified data. If the specified data is found in the internal cache 14, hereafter denoted as a cache hit, that data is immediately provided to the CPU 12 for processing.
If, on the other hand, the specified data is not found in the internal cache 14, the external cache 16, is then searched. If the specified data is not found in the external cache 16, then the primary memory 18 is searched. The external cache 16 and primary memory 18 are controlled by an external cache controller 20 and a primary memory controller 22, respectively, which may be both housed within the CPU 12. If the specified data is not found in the primary memory 18, access is requested to system bus 24 which, when available, routes the address request to a secondary memory 26 via an I/O controller 28.
When the specified data is located in memory external to the CPU 12, i.e., in either the external cache 16, the primary memory 18, or the secondary memory 26, the data specified by the address request is routed to the CPU 12 for processing and, in addition, a corresponding row of data is loaded into the internal cache 14. In this manner, subsequent address requests identifying other information in that row will result in an internal cache hit and, therefore, will not require access to the much slower external memory. In this manner, latencies associated with accessing primary memory may be hidden, thereby increasing the data bandwidth of the CPU 12.
The processing of an address request through a memory hierarchy is illustrated in FIG. 2. First, the CPU program counter (PC) is incremented to specify a new address and, in response thereto, a corresponding instruction is fetched (step 40). Where, for instance, the instruction requests data, an address request specifying that data is provided to the data cache (D$) of the internal cache 14 for searching (step 42). If the specified data is in the data cache (a D$ hit), as tested at step 44, the specified data is immediately provided to the CPU (step 46). If the specified data is not in the data cache (a D$ miss), the external cache is searched for the specified data (step 48).
If the specified data is found in the external cache (an E$ hit), as tested at step 50, then the specified data is loaded into the data cache (step 52) and processing proceeds to step 44. If the specified data is not found in the external cache, then primary memory is searched (step 54). If the specified data is found in primary memory, as tested at step 56, it is loaded into the data cache (step 52) and provided to the CPU for processing; otherwise the specified data is retrieved from secondary memory (step 58) and loaded into the data cache and provided to the CPU.
As shown in FIG. 1, there are additional devices connected to the system bus 20. For example, FIG. 1 illustrates an input/output controller 30 operating as an interface between a graphics device 32 and the system bus 24. In addition, the figure illustrates an input/output controller 34 operating as an interface between a network connection circuit 36 and the system bus 24.
Since latencies of primary memory, e.g., the access speeds of DRAM, are not increasing as quickly as are the processing speeds of modern CPUs, it is becoming increasingly important to hide primary memory latencies. As discussed above, primary memory latencies are hidden every time there is an internal cache hit, for when there is such a hit, the requested information is immediately provided to the CPU for processing without accessing primary memory.
The data bandwidth of a computer system may also be increased by providing an additional parallel pipeline such that, for instance, two data requests may be performed per cycle. To accommodate the additional pipeline, the existing data cache may be dual ported or an additional data cache may be provided in parallel to the existing data cache. Each of these options, however, effectively doubles the cost of data cache memory. For instance, dual porting the existing data cache, while not significantly increasing the total size of the data cache, results in halving the effective data cache memory available for each of the pipelines. On the other hand, providing in parallel an additional data cache similar in size to the existing data cache, while preserving the effective cache memory available for each pipeline, undesirably results in a doubling of the effective size of the data cache. As a result, there is a need to accommodate an additional parallel pipeline without doubling the cost of data cache memory.