1. Field of the Invention
The present invention relates to data and instruction access in a computer system and, more particularly, to a method and an architecture capable of accessing data and instructions using store and forward.
2. Description of Related Art
The processing speed of CPU of a modern computer has increased significantly. Furthermore, such trend of increase is still continuing. It is known that a corresponding increase in accessing memory is required for increasing the total data and/or instruction access efficiency of the computer. In other words, a relatively slow memory is a bottleneck of the efficiency increase of the computer. For solving this problem, a cache memory is thus developed, in which a memory access unit is defined to have a constant length composed of a predetermined number of instructions or data, and such unit is called a cache line. The length of the unit is critical. For example, in a memory having a burst transfer capability, multiple data accesses can be performed by only giving one address and associated setting, so that a data string having the assigned burst length is continuously transferred. As a result, an initial delay prior to data transfer is decreased. In such memory, the length of the cache line is related to the burst length.
With reference to FIG. 1, it presents schematically a conventional processor architecture having the above cache capability. As shown, in case that a cache line having the required data or instructions is in the cache module 11a, the processor kernel 14 can fetch required data or instructions from a cache module 11 directly with no or very low time delay. However, if the required data or instructions are not in the cache module 11, a cache miss is encountered. At this moment, the processor kernel 14 has to command the cache module 11 to read the required data or instructions from a memory device 13. Such an operation is called cache refill. Thus, a significant system delay (called cache miss penalty) is occurred since all cache lines have to be stored in the cache module 11.
The cache miss penalty often occurs continuously when the processor kernel 14 accesses a certain section of program codes or data section at the first time. This can adversely affect the performance of the computer system. For solving this problem, a prefetching is proposed. As shown in FIG. 2, a prefetch module 12 is provided between the cache module 11 and the memory device 13. The prefetch module 12 acts to predict possible sections of program codes or data sections to be used next by the processor kernel 14 and read the same into the prefetch module 12. Once the processor kernel 14 finds that it is unable to get required data or instructions from the cache module 11 (i.e., a cache miss occurred), the prefetch module 12 is checked to search the data or instructions. If the required data or instructions are already in the prefetch module 12, a successful access is then realized, and the required cache lines are stored in the cache module 11 by reading the same from the prefetch module 12. As a result, the cache miss penalty is greatly reduced. However, a prefetch miss still may occur if the required data or instructions are not in the prefetch module 12. It is still required to get the required cache lines from the external memory device 13. Thus, a significant system delay (called prefetch miss penalty) is occurred.
Conventionally, the architecture of the prefetch module 12 is configured to be the same as the cache module, and thus the cache line is employed as the data length of the prefetch module 12. In other words, the length of a burst transfer in a dynamic random access memory (DRAM) is taken as a data transfer unit. However, the interface either between the prefetch module 12 and the cache module 11 or between the pre-fetch module 11 and the processor kernel 14 is not a DRAM interface. Hence, it is not necessary to take the cache line as a data transfer unit. Practically, data transfer rate may be significantly lowered if the cache line is used as the data transfer unit.
Specifically, three interfaces are provided in the processor structure with cache module 11 and prefetch module 12. The first interface 15 is an external interface between the prefetch module 12 and the external memory device 13. The second interface 16 is provided between the prefetch module 12 and the cache module 11. The third interface 17 is provided between the cache module 11 and the processor kernel 14 for transferring data/instruction from the cache module 11 to the processor kernel 14. Conventionally, data transfer unit taken in each of the first and the second interfaces 15 and 16 is the same as the data length of the cache line. As for data access via the third interface 17, if it is related to data access of either first or second interface, the data access can be performed only after the cache line has been accessed. However, the data length of the cache line is not an optimum data transfer unit between the prefetch module 12 and any one of the memory device 13, the cache module 11, and the processor kernel 14. This is because a length of the cache line is related to structure of the cache module 11. Theoretically, the length of the cache line is fixed during the working cycles of the processor kernel 14. However, the processor kernel 14 is dynamic in accessing data/instruction when being executed. Hence, an optimum performance of the processor kernel 14 is not obtained if the cache line having the fixed length is taken as the data transfer unit. As a result, resources are wasted.
With reference to the timing diagram of FIG. 3, it is assumed that an instruction/data sequence required by the process kernel 14 is read sequentially from #0 instruction/data. As shown, “Init” is referred to initial delay. Instruction/data read by the prefetch module 12 is first sent to the cache module 11 and then sent to the processor kernel 14 therefrom. A cache miss is occurred in the #0 instruction/data reading. Hence, the processor kernel 14 must wait for the completion of the reading from the memory device 13 to the prefetch module 12 and a transfer of the read data from the prefetch module 12 to the cache module 11 prior to obtaining the required instruction/data. Further, a cache miss is occurred in the #4 instruction/data reading. Fortunately, the processor kernel 14 does not have to wait the completion of the reading from the memory device 13 to the prefetch module 12 and the transfer of the read data from the prefetch module 12 to the cache module 11 due to the provision of the prefetch module 12. Instead, instruction/data is sent to the cache module 11 from the prefetch module 12 directly prior to being accessed by the processor kernel 14.
With reference to FIG. 4, there is shown a timing diagram of another example. It is assumed that a reading of instruction/data sequence required by the processor kernel 14 jumps from #2 instruction/data to #80 instruction/data after the reading of #2 instruction/data has been completed. This in turn causes a cache miss in the #80 instruction/data reading. Further, a prefetch miss is occurred since the #80 instruction/data is not the same as any of the #4, #5, #6, and #7 instructions/data read by the prefetch module 12. As such, read instruction/data in the prefetch module 12 must be discarded. In response, the prefetch module 12 must be activated again for completing an instruction/data reading from the memory device 13 and transferring the read one to the cache module 11. Finally, the processor kernel 14 may access the read instruction/data. Such waiting of the processor kernel 14 may adversely affect a data transfer rate of the computer.