1. Field of the Invention
The present invention relates to a method of and a circuit for prefetching instructions/data using a non-referenced prefetch (NRP) cache, capable of not only reducing the number of cache misses and decreasing the access latency to the lower memory, but also reducing the memory traffic by storing blocks prefetched from the lower memory, but not referenced by the central processing unit (CPU) in an NRP cache.
2. Description of the Prior Art
Although CPUs are progressing very rapidly in performance, the progression of memories in performance is not as fast as that of such CPUs. Such a performance gap between the CPU and the memory will become more severe as time goes by. Efficiently constructing memory hierarchy is highlighted as an important factor that affects the overall performance of the computer system.
In order to obtain an efficient memory system, most modern computer systems use a cache memory. Basically, such a cache memory is adapted to utilize the locality of reference exhibited when the computer executes a program. During the execution of a program in a computer, the central processing unit (CPU) of the computer references the lower memory. In a certain limited time period, such references are concentrated in some portion of the lower memory. If the CPU references an item, there is a tendency for the items having adjacent addresses to also he referenced. This is referred to as "spatial locality". In case of general programs. several loops are executed which require most of the entire program's execution time. In one loop, the same instructions are repeatedly executed. In particular, it is often the case where the currently referenced instructions will be referenced again shortly thereafter. This is referred to as "temporal locality". The principle of the cache memory is to utilize such reference localities, namely, to store the frequently used blocks of the lower memory in a memory with a very high reference speed interposed between the lower memory and the CPU. Since most of the application programs have both the above-mentioned features, the cache memory can process 90% or more of the entire references to the lower memory by the CPU even when it has a small memory capacity of several kilobytes.
In the use of the cache memory, if an instruction block which the CPU is to reference exists within the cache memory, the CPU readily refers to that block. This status is called "cache hit". On the other hand, the status that the cache memory has no instruction block to be referenced by the CPU is called "cache miss". A hit ratio is used as a measure of the performance of the cache memory. The hit ratio is expressed by the following equation: ##EQU1##
Meanwhile, cache misses are classified into three kinds, namely, compulsory miss, conflict miss, and capacity miss. The compulsory miss is a cache miss occurring when a block is initially referenced. The conflict miss is a cache miss occurring when a block mapped in the corresponding cache memory location, but subsequently replaced by another block is referenced. On the other hand, the capacity miss is a cache miss occurring when the working set, that is, a set of pages frequently referenced by the CPU in execution of an application program, has a larger capacity than the cache memory.
These kinds of cache misses have different ratios thereof to the entire cache misses as the capacity of the cache memory increases. Although the number of compulsory misses is approximately constant irrespective of a variation in the capacity of the cache memory, numbers of conflict misses and capacity misses tend to decrease as the capacity of the cache memory increases. Since the on-chip cache memory capacity has an increasing trend owing to the development of very large scale integration (VLSI), the portion of compulsory misses over the entire cache misses also tends to increase. Under such circumstances, methods for effectively reducing the number of compulsory misses become more important because most modern computer systems use a large capacity on-chip cache memory.
Of the methods for reducing the number of compulsory misses, the simplest one is to increase the size of a cache block. This is because more instructions are brought into the cache memory on a cache miss and further cache misses caused by sequential access of memory references can be reduced with a larger cache block. However, this approach involves an increase in the CPU cycle taken to fetch a block from the lower memory to the cache memory and an increase in the memory traffic. Moreover, cache pollution may occur due to the large cache block size because the entire block should be replaced on a cache miss even when only a portion of the block is referenced. Such cache pollution results in a degradation in performance. Because the size of a cache block should be selected based oil architectural features such as memory latency and transfer rate, it can not be simply increased only for the purpose of instruction prefetching.
In order to overcome the above-mentioned disadvantages, various prefetching techniques halve been proposed. Prefetchlinig is to fetch a memory block expected to be referenced from the lower memory to the upper memory prior to a reference to the memory block by the CPU. The sequential prefetching is the simplest form of prefetching. In sequential prefetching, the sequential next block of the block currently referenced by the CPU is prefetched. This sequential prefetching technique can provide a great increase in performance for an application program involving a memory reference with a large sequentiality. Generally, the reference to instructions has a large sequentiality as compared to the reference to data. By virtue of such a feature, the sequential prefetching provides a relatively superior performance. Also, the sequential prefetching requires less additional hardware to achieve its prefetch purpose. When the reference to instructions does not follow sequential execution flows, however, no performance improvement is expected by the sequential prefetching. In other words, the sequential prefetching has a disadvantage that the performance gain obtained by the sequential prefetching is not so high when the memory reference is executed along non-sequential execution flows, as in the case of conditional branch instructions or unconditional branch instructions.
Target prefetching is to determine a block to be prefetched by using history information about previous control flows stored in a prediction table. This target prefetching is based on the fact that branch instructions, even if they are conditional branches, tend to follow the previous control flows again in many cases. If a sequential block of a particular block was referenced in previous execution, the sequential block will be prefetched. On the other hand, if a non-sequential block of a particular block was referenced to in previous execution, the non-sequential block will be prefetched. For example, where a block B was referenced after block A in previous execution, the block B will be prefetched upon referencing the block A at later times. Target prefetching has a higher prefetch accuracy thin sequential prefetching because it utilizes the characteristic of branch instructions. However it is difficult to expect a great improvement in performance by the target prefetchinig because branch instructions do not always follow the previous control flows and especially in cases that the memory reference to branch instructions follows sequential and non-sequential control flows in an alternating manner.
A hybrid prefetching is a prefetch method to prefetch both sequential and non-sequential blocks to overcome these shortcomings. But the hybrid prefetching can used in supercomputers which have less memory bandwidth restriction. In the case of microprocessors, the hybrid prefetching has not been considered due to the limited bandwidth. Only a modified form of the hybrid prefetching such as a prefetch method to execute a prefetch on the basis of the priority among blocks to be prefetched can be performed in microprocessor-based systems. The priority among blocks to be prefetched can be determined using information about the previous control flows. This hybrid prefetching is a mechanism modified from existing hybrid prefetch mechanisms. This mechanism will be described in more detail.
Where the memory reference was executed to follow sequential control flows in the previous execution, the sequential next block of the block currently referenced (the sequential next block has a block address corresponding to "the address of the current block +1") is determined as the first one of the candidate blocks to be prefetched. Also, the target block of the current block is determined as the second one of the candidate blocks to be prefetched. On the other hand, where the memory reference was executed to follow non-sequential control flows in the previous execution, the target block of the current block is determined as the first one of the candidate blocks to be prefetched. Also, the sequential next block of the block currently referenced is determined as the second one of the candidate blocks to be prefetched. After determining the first and second candidate blocks to be prefetched, a prefetching operation for the determined candidate blocks is executed. If the first candidate block is not resident in the on-chip memory, it is prefetched from the lower memory to the on-chip memory. After completing this prefetching, the prefetching operation for the candidate blocks is completed. When the first candidate block is resident in the on-chip memory, the second candidate block is prefetched from the lower memory to the on-chip memory if it is not resident in the on-chip memory. On the other hand, when both the first and second candidate blocks are resident in the on-chip memory, no prefetching operation is executed. Thus, the modified hybrid prefetch mechanism can prefetch both the sequential block and the target block in many cases. However, this scheme can not achieve much improvement in performance. Although the improvement in performance achieved by the existing hybrid prefetching technique is not so little, the technique has a disadvantage that it can be implemented only in computer systems that have less restriction on memory bandwidth such as supercomputers.