A typical central processing unit (CPU) architecture consists of many pipeline stages. The CPU executes instructions in a pipelined fashion. Consecutive instructions in the program flow are processed in consecutive pipeline stages of the CPU simultaneously. This process is similar to a manufacturing assembly line. This process can greatly enhance the instructions executed per second throughput of the CPU. While the logic pipeline depth can be increased by splitting the logic across multiple pipeline stages, memory access in the data processing system often prove to be the bottleneck.
Many central processing architectures employ cache memory to speed memory access. A cache memory is a small, fast memory located close to the central processing unit core. The cache memory stores shadow copies of data from a much larger and more distant main memory. Access time to data stored in the cache is faster than access time to data stored in main memory. Cache memory is advantageous due to locality of memory access. Once data at a particular address location has been used, it is likely to be used again in the near future. Making a shadow copy of data in the cache has no advantage. However, additional accesses to that data in the near future can be serviced from that cache without waiting the longer time for access to main memory.
There is a tension between cache size and access speed. A larger cache may be advantageous because there is a larger probability that a particular data access will hit within the cache. However, a larger cache typically requires more area on the integrated circuit holding the central processing unit. This larger size also tends to make accesses slower. Thus any cache size selection is a compromise between contradictory goals. Thus as the amount of on-chip memory increases, the access times increase causing performance issues.
This invention is directed at the problem of performance bottlenecks created in memory to memory data paths. Memory to memory data paths often occur in cache systems where the cache memory misses are serviced by other on-chip main memory. To ensure minimum cache miss latency, a data fetch operation needed to service a cache miss in the cache memory from the main memory should use the smallest possible number of cycles. A single instruction cycle is best. Because the main memory access time is large, the data path from the main memory to the cache memory is typically longer than the target clock period. The prior art typically responded to this situation in one of two ways. It is feasible to reduce the clock frequency to permit single cycle access to main memory. This clock frequency reduction reduces the performance of the entire data processing system. Alternatively, the designer may insert an extra latency of 1 cycle in this memory to memory data path. This increases the cache fill latency. Thus there is a need in the art to provide reduced data access latency.