As integrated circuit technology progresses to smaller feature sizes, faster central processing units (CPU)s are being developed as a result. Unfortunately access times of main memory, in the form of random access memory (RAM), where instruction data is typically stored, have not yet matched those of the CPU. The CPU must access these slower devices in order to retrieve instructions therefrom for processing thereof. In retrieving these instructions a bottleneck is realized between the CPU and the slower RAM. Typically, in order to reduce the effect of this bottleneck a cache memory is implemented between the main memory and the CPU to provide most recently used (MRU) instructions and data to the processor with lower latency.
It is known to those of skill in the art that cache memory is typically smaller in size and provides faster access times than main memory. These faster access times are facilitated by the cache memory typically residing within the processor, or very close by. Cache memory is typically of a different physical type than main memory. Main memory utilizes capacitors for storing data, where refresh cycles are necessary in order to maintain charge on the capacitors. Cache memory on the other hand does not require refreshing like main memory. Cache memory is typically in the form of static random access memory (SRAM), where each bit is stored without refreshing using approximately six transistors. Because more transistors are utilized to represent the bits within SRAM, the size per bit of this type of memory is much larger than dynamic RAM and as a result is also considerably more expensive than dynamic RAM. Therefore cache memory is used sparingly within computer systems, where this relatively smaller high-speed memory is typically used to hold the contents of the most recently processor utilized blocks of main memory.
The purpose of the cache memory is to increase instruction and data bandwidth of information flowing from the main memory to the CPU, as well as to decrease latency involved in transferring of information from main memory. Bandwidth determines the amount of information that can be transferred in a certain amount of time; latency determines the time it takes to retrieve a certain information item, in other words the amount of time it takes from the time of request to the time of delivery of the item. There are different configurations of cache memory that provide for this increased bandwidth and decreased latency, such as direct mapped and cache way set-associative. Of course, both latency and bandwidth are important in cache memory design and to many of skill in the art it is a fact that the cache way set-associative cache structure is preferable.
In the set-associative structure, as well as in direct mapped cache structures, the cache memory is typically configured into two parts, a data array and a tag array. The tag array is for storing a tag address for corresponding data bytes stored in the data array. Typically, each tag entry is associated with a data array entry, where each tag entry stores index information relating to each data array entry. Both arrays are two-dimensional and are organized into rows and columns. A column within either the data array or the tag array is typically referred to as a cache way, where there can be more than one cache way in a same cache memory. Thus a four-cache way set-associative cache memory would be configured with four columns, or cache ways, where both the data and tag arrays also have four columns each.
The processor executes instructions contained in the program stream in a similar manner to that of an assembly line. A processor pipeline is provided to facilitate processor operation on a number of instructions contained in the program stream. The term processor pipeline refers to a number of processing steps, or stages, it takes to accomplish a task; the fewer the processing steps, the shorter and more efficient the pipeline. The depth of a stage within the processor pipeline is defined as being the time delay through the stage. Thus, the depth of a stage affects the operating frequency of the processor. As a result, deeper pipelines having more stages have been implemented in processor. Having a lot of stages results in each stage having a shorter time delay, which facilitates a higher operating frequency of that stage. Efficient processor design balances the amount of stages and the delay through the stages. Thus, both the number of stages and the depth of each stage determine the processing efficiency of a processor. Within the processor pipeline, each processor pipeline stage performs a specific task before passing execution to a next processing stage within the pipeline. Within the processor pipeline instructions are loaded sequentially, in that, a portion of one instruction is processed after a previous one. Because the processor works on different stages of different instruction at the same time, more instructions can be executed in a shorter period of time using the pipeline, thus speeding up execution of the program stream. Once an instruction has been fully processed it leaves the pipeline.
During CPU execution of instructions within the pipeline both main memory and cache memory are accessed. Cache memory is accessed first to see if corresponding data bytes fulfill the memory access request, if the memory access request is unfulfilled then main memory is accessed. In a cache memory access operation the set-associative cache is accessed by a load/store unit, which searches the data cache for data residing at the memory access address. A cache way prediction array is a memory in which is stored an index into a cache way where the data blocks are most likely to reside in the data array. The cache way prediction array typically stores a number of bits in each entry, where each entry indicates which cache way has been most-recently used (MRU) with respect to the memory access address. Rows of the cache way prediction array are correlated to the memory access address. Providing a cache way prediction array potentially decreases processing time since a potential cache way is accessed prior to tag comparison. When a cache access is performed the cache way prediction array is used for accessing of a cache way in which data may potentially reside, thus reducing memory access times if a correct cache way prediction is realized. Of course, this is the case when a sequential tag lookup is followed by a data structure lookup. Approaches that access tag and data structures simultaneously have the same access time as the approach based on cache way prediction. The advantage of using cache way prediction lies in that only a single predicted cache way is accessed from the data array. Therefore, it allows for reduced power consumption and a more optimal physical layout in SRAM memory circuits.
After cache way prediction, a cache way represented by the prediction bits is enabled in data array, whereas all cache ways are enabled in the tag array. Enabling of a single cache way in the data array provides for decreased power consumption. All tag array ways are enabled because later on the prediction is validated using tag addresses retrieved from the tag array. Typically the tag addresses are smaller in bit size than data retrieved from an entry in the data array. As a result, power consumption is less of an issue when all cache ways are enabled in the tag array. The reduced power consumption is especially beneficial in multi-issue superscalar or Very Long Instruction Word (VLIW) processors, where a lot of instruction information is retrieved in every cycle. Of course to those of skill in the art it is appreciated that this is typical for an instruction cache only and not for data caches, which are discussed hereinbelow. The tag addresses within the tag array are examined to determine if any match the memory access address. In order to provide feedback as to the accuracy of the cache way prediction a tag comparison logic circuit is used to perform tag comparison in order to verify whether the cache was “hit” or “missed.” If a match is found, the access is said to be a cache hit and the cache memory provides the associated data bytes from the data array. If a match is not found, the access is said to be a cache miss. When a cache miss occurs, the processor experiences a stall cycles. During the stall cycles the load/store unit retrieves required data from the main memory and provides this data to the processor so it can resume its execution of the instruction stream to end the stall cycle. Additionally, the load/store unit causes the requested data bytes to be transferred from the main memory into the data array. Furthermore, a portion of the memory access address associated with the retrieved data bytes is stored in the tag array. Finally, the cache way prediction array is updated to reflect the cache way to which the data bytes transferred from main memory. Having a tag comparison logic circuit in association with the cache way prediction array provides for a cache memory that offers a greater cache hit percentage.
Cache way prediction is well known to those of skill in the art, an example of which is disclosed in prior art U.S. Pat. No. 5,752,069, entitled “Superscalar microprocessor employing away [sic] prediction structure.” In an n-cache way set associative cache memory, a requested cache element resides in any, or none of the n-cache ways within the cache memory in dependence upon the memory access address.
By relying on cache way prediction to identify the cache way in which the requested element is likely to reside, it is possible have access to only the identified cache way rather than accessing all cache ways, thereby advantageously reducing power consumption. In such a manner data bytes are transferred from the selected cache way in order to make up a cache line, or cache word for use by the processor thereafter. Cache way prediction therefore not only offers reduced power consumption but also offers improved access times to data contained in cache memory. A miss-prediction unfortunately causes processor stall cycles, thereby reducing processor performance and causing increased power consumption. Fortunately, the cache way prediction is typically correct at least 90% of the time, which generally improves system performance by allowing for minimal wait state operation in at 90% of the potential cases.
For instruction caches, the prediction accuracy of cache memories unfortunately decreases as non-sequential instructions are added into the program stream. Non-sequential instructions cause the CPU to access memory locations that are difficult to predict using the prior art cache way prediction array methodology. In some cases non-sequential instructions result in branching of a program to memory locations that are dependent upon the outcomes of prior instructions within the program stream and therefore a step size of these branches may vary as the program stream is executed. Therefore prediction accuracy for non-sequential program execution is reduced and as a result more cache misses occur. This causes increased processor stall cycles, thus leading to inefficient program stream execution and results in increased power consumption. For sequential instructions a look-ahead program counter is used to access the prediction memory. Unfortunately, when executing of branch instructions within a program stream the look-ahead program counter is not available for use in cache way prediction.
There exists a need to provide a more efficient cache way prediction scheme, for use in an instruction cache, that offers decreased power consumption while providing improved cache way prediction for non-sequential program execution.