Microprocessor performance may be increased by enabling multiple cache load operations to be executed within the same cycle. One method of increasing the load bandwidth of a microprocessor is to support additional cache ports that may be accessed in parallel. However, supporting additional cache ports within cache memories, such as a Level 1 (L1) cache can be expensive in terms of die area and cycle time.
Other techniques to increase load bandwidth include interleaving, replication, time-division multiplexing, and line buffering. Interleaving involves dividing a cache into a number of sub-banks and using low-order address bits to access the banks. However, interleaving requires more die area for crossbar switching to direct loads and retired stores to the proper cache bank.
Replication involves emulating an N-port cache by replicating an M-port data cache array N/M times. While replication eliminates the bank conflict problem of interleaving, it may be expensive in terms of die area. Furthermore, while replication addresses the load-bandwidth problem, it exacerbates the store-bandwidth problem since store traffic must be broadcast to all of the replicated arrays simultaneously in order to ensure that each array has an updated copy of the data.
Time-division multiplexing involves emulating an N-port data cache by decreasing the cycle time of an M-port array by a factor of N/M. However, time-division multiplexing is difficult and expensive to implement and scale to higher frequencies.
Line buffering involves adding a small line-buffer that holds cache lines recently read from the cache by load operations. Subsequent loads may obtain data from this buffer, which can be multi-ported due to its small size. However, line buffering is complex and expensive in terms of cycle times, because loads that miss in the buffer must go to the cache, thereby increasing latency.