The invention is generally related to data processing systems and processors therefor, and in particular to retrieval of data from a data cache in a multi-level memory architecture.
Computer technology continues to advance at a remarkable pace, with numerous improvements being made to the performance of both microprocessorsxe2x80x94the xe2x80x9cbrainsxe2x80x9d of a computerxe2x80x94and the memory that stores the information processed by a computer.
In general, a microprocessor operates by executing a sequence of instructions that form a computer program. The instructions are typically stored in a memory system having a plurality of storage locations identified by unique memory addresses. The memory addresses collectively define a xe2x80x9cmemory address space,xe2x80x9d representing the addressable range of memory addresses that can be accessed by a microprocessor.
Both the instructions forming a computer program and the data operated upon by those instructions are often stored in a memory system and retrieved as necessary by the microprocessor when executing the computer program. The speed of microprocessors, however, has increased relative to that of memory devices to the extent that retrieving instructions and data from a memory can often become a significant bottleneck on performance. To decrease this bottleneck, it is desirable to use the fastest available memory devices possible. However, both memory speed and memory capacity are typically directly related to cost, and as a result, many computer designs must balance memory speed and capacity with cost.
A predominant manner of obtaining such a balance is to use multiple xe2x80x9clevelsxe2x80x9d of memories in a memory architecture to attempt to decrease costs with minimal impact on system performance. Often, a computer relies on a relatively large, slow and inexpensive mass storage system such as a hard disk drive or other external storage device, an intermediate main memory that uses dynamic random access memory devices (DRAM""s) or other volatile memory storage devices, and one or more high speed, limited capacity cache memories, or caches, implemented with static random access memory devices (SRAM""s) or the like. In some instances, instructions and data are stored in separate instruction and data cache memories to permit instructions and data to be accessed in parallel. One or more memory controllers are then used to swap the information from segments of memory addresses, often known as xe2x80x9ccache linesxe2x80x9d, between the various memory levels to attempt to maximize the frequency that requested memory addresses are stored in the fastest cache memory accessible by the microprocessor. Whenever a memory access request attempts to access a memory address that is not cached in a cache memory, a xe2x80x9ccache missxe2x80x9d occurs. As a result of a cache miss, the cache line for a memory address typically must be retrieved from a relatively slow, lower level memory, often with a significant performance penalty.
Data cache misses in particular have been found to significantly limit processor performance. In some designs, for example, it has been found that over 25% of a microprocessor""s time is spent waiting for data cache misses to complete. Therefore, any mechanism that can reduce the frequency and/or latency of data cache misses can have a significant impact on overall performance.
One conventional approach for reducing the impact of data cache misses is to increase the size of the data cache to in effect reduce the frequency of misses. However, increasing the size of a data cache can add significant cost. Furthermore, oftentimes the size of the data cache is limited by the amount of space available on an integrated circuit device. Particularly when the data cache is integrated onto the same integrated circuit device as a microprocessor to improve performance, the amount of space available for the data cache is significantly restricted.
Other conventional approaches include decreasing the miss rate by increasing the associativity of a cache, and/or using cache indexing to reduce conflicts. While each approach can reduce the frequency of data cache misses, however, each approach still incurs an often substantial performance hit whenever data cache misses occur.
Yet another conventional approach for reducing the impact of data cache misses incorporates value prediction to attempt to predict what data will be returned in response to a data cache miss prior to actual receipt of such data. In particular, it has been found that the result of practically any instruction can be predicted approximately 50% of the time based upon the result of the last execution of the instruction.
To implement value prediction, it has been proposed to store the result of each instruction in a lookup table after the instruction is executed. The result would be indexed by the memory address of the instruction. Subsequently, whenever the same instruction was executed again, the lookup table would be accessed to attempt to locate the result at the same time that the data cache was accessed. If the data cache access missed, the predicted result would be used, and subsequent instructions would be executed speculatively using the predicted result while the data cache miss was processed. Then, when the data in the data cache was returned, it would be compared to the predicted result to verify the prediction. If the prediction was correct, a performance benefit would be obtained since the subsequent instructions were executed sooner than would otherwise occur if the processor waited for the data from the data cache to be returned. On the other hand, if the prediction was incorrect, the processor would need to be xe2x80x9crolled backxe2x80x9d to essentially undo the results of the speculatively-executed instructions. Assuming a relatively reliable prediction, however, the benefits of prediction would exceed the penalties of misprediction, resulting in an overall performance improvement.
One problem associated with proposed value prediction implementations is that a relatively large lookup table would be required to achieve a significant performance improvement. Specifically, with proposed implementations, predicted values are stored for either all static instructions or all static load instructions. However, it has been found that most commercial workloads have a relatively large instruction working setsxe2x80x94that is, a relatively large number of instructions are typically executed before any particular instruction is repeated. Since value prediction relies on the results from previous executions of instructions, a lookup table would need to be relatively large to ensure that predicted data was available on a relatively frequent basis.
However, given that space on a processor integrated circuit device is often at a premium, it is often desirable to minimize the space occupied by all components on the device, including any lookup tables. Consequently, the size of a value prediction lookup table is often constrained, which by necessity limits its effectiveness. Increasing the size of a lookup table often increases costs and/or requires other compromises to be made in other areas of a processor design. Therefore, a need still exists in the art for improving the effectiveness of value prediction in a more compact and cost effective manner.
The invention addresses these and other problems associated with the prior art by providing a data processing system, circuit arrangement, integrated circuit device, program product, and method that implement value prediction in a data cache miss lookaside buffer that maintains predicted values only for load instructions that miss the data cache. It has been found that a large proportion of data cache misses, e.g., as many as 80-90% or more, are caused by a relatively few number of instructions. Moreover, it has been found that the predictability of load instructions that miss a data cache is often greater than other instructions. As a result of both of these factors, limiting value prediction to load instructions that miss the data cache enables significantly more effective value prediction for a given size buffer or lookup table.
A circuit arrangement consistent with the invention includes a control circuit and a buffer coupled to an execution circuit. The execution circuit processes a load instruction by initiating retrieval of a value requested by the load instruction from a memory having a cache. The buffer stores a predicted value for the load instruction, and a control circuit provides the predicted value to the execution circuit when retrieval of the value requested by the load instruction misses the cache.
One method for executing instructions in a computer consistent with the invention includes determining whether a value requested by a load instruction being executed is stored in a cache, and in response to determining that the value is not stored in the cache, predicting the value requested by the load instruction. Another method for executing instructions in a computer consistent with the invention includes storing a value requested by a load instruction in response to a cache miss for the value, and retrieving the value during a subsequent execution of the load instruction.
These and other advantages and features, which characterize the invention, are set forth in the claims annexed hereto and forming a further part hereof. However, for a better understanding of the invention, and of the advantages and objectives attained through its use, reference should be made to the Drawings, and to the accompanying descriptive matter, in which there is described exemplary embodiments of the invention.