In modern microprocessor systems, processor cycle time continues to decrease as technology continues to improve. Design improvements, such as speculative execution, deeper pipelines, and more execution elements, increase the performance of processing systems and put a heavier burden on the memory interface, since the processor demands data and instructions more rapidly from memory. In order to keep pace with the heightened speed of the processing systems, cache memories are often implemented in microprocessors.
The basic operation of cache memories is well-known. When a processor ("CPU") needs to access memory, the cache is examined. If the word addressed by the CPU is found in the cache, it is read from the "fast" cache memory. If the word addressed by the CPU is not found in the cache, the main memory is accessed to read the word. A block of words containing the one just accessed is then transferred from main memory to cache memory. In this manner, some data is transferred to cache so that future references to memory find the required words in the fast cache memory.
Processing systems employing cache memories are well known in the art. Cache memories are very high-speed devices that increase the speed of a data processing system by making current programs and data available to a CPU with a minimal amount of latency delay. Large on-chip caches (L1 caches) are implemented to reduce memory latency and often are augmented by larger off-chip caches (L2 caches). Although cache memory is only a small fraction of the size of main memory, a large fraction of memory requests are successfully found in the fast cache memory because of the "locality of reference" property of programs. This property holds that memory references during any given time interval tend to be confined to a few localized areas of memory. Cache memories improve system performance by keeping the most frequently accessed instructions and data in the fast cache memory, thereby allowing the average memory access time of the overall processing system to approach the access time of the cache.
It has therefore become important to reduce the amount of latency in each cache access in order to meet the memory access demands resulting from the decrease in machine cycle times and from the large volume of instructions issued by superscalar machines. A cache access normally involves the generation of an address by adding two numbers, decoding this address to select a particular row of locations in the cache, reading those locations and selecting the desired part of the row, and, often, reordering the data read from the cache to a suitable format. These steps are generally performed in a sequential manner. First, the addition of the address operands is normally completed before the sum is presented to the decoder. Next, full decoding of the row selection portion of the address must be done to select one of the memory wordlines. Finally, the required bytes within the cache line are selected and possibly reordered only after data from the chosen row are impressed on the bitlines. Thus, the latency for a load operation is the sum of delays for addition, decoding, cache array access, byte selection, and byte reordering.
There is therefore a need for cache memories capable of performing at least some of the steps involved in a cache access in parallel with one other steps in order to reduce cache latency.
In particular, there is a need for cache memories capable of re-ordering bytes in parallel with the decoding and access of the data bytes being read from the cache.