The present invention relates to accessing memory, and more particularly to reducing latency while accessing memory.
Prior art FIG. 1 illustrates one exemplary prior art architecture that relies on conventional techniques of accessing information in memory. As shown, a processor 102 is provided which is coupled to a Northbridge 104 via a system bus 106. The Northbridge 104 is in turn coupled to dynamic random access memory (DRAM) 108. In use, the processor 102 sends requests to the Northbridge 104 for information stored in the DRAM 108. In response to such requests, the Northbridge 104 retrieves information from the DRAM 108 for delivering the same to the processor 102 via the system bus 106. Such process of calling and waiting for the retrieval of information from the DRAM 108 often causes latency in the performance of operations by the processor 102. One solution to such latency involves the utilization of high-speed cache memory 110 on the Northbridge 104 or the processor 102 for storing instructions and/or data.
Cache memory has long been used in data processing systems to improve the performance thereof. A cache memory is a relatively high speed, relatively small memory in which active portions of program instructions and/or data are placed. The cache memory is typically faster than main memory by a factor of up to ten or more, and typically approaches the speed of the processor itself. By keeping the most frequently accessed and/or predicted information in the high-speed cache memory, the average memory access time approaches the access time of the cache.
The need for cache memory continues even as the speed and density of microelectronic devices improve. In particular, as microelectronic technology improves, processors are becoming faster. Every new generation of processors is about twice as fast as the previous generation, due to the shrinking features of integrated circuits. Unfortunately, memory speed has not increased concurrently with microprocessor speed. DRAM technology rides the same technological curve as microprocessors, technological improvements yield denser DRAMs, but not substantially faster DRAMs. Thus, while microprocessor performance has improved by a factor of about one thousand in the last ten to fifteen years, DRAM speeds have improved by only 50%. Accordingly, there is currently about a twenty-fold gap between the speed of present day microprocessors and DRAM. In the future this speed discrepancy between the processor and memory will likely increase.
Caching reduces this large speed discrepancy between processor and memory cycle times by using a fast static memory buffer to hold a small portion of the instructions and/or data that are currently being used. When the processor needs a new instruction and/or data, it first looks in the cache. If the instruction and/or data is in the cache (referred to a cache xe2x80x9chitxe2x80x9d), the processor can obtain the instruction and/or data quickly and proceed with the computation. If the instruction and/or data is not in the cache (referred to a cache xe2x80x9cmissxe2x80x9d), the processor must wait for the instruction and/or data to be loaded from main memory.
Cache performance relies on the phenomena of xe2x80x9clocality of referencexe2x80x9d. The locality of reference phenomena recognizes that most computer program processing proceeds in a sequential fashion with multiple loops, and with the processor repeatedly accessing a set of instructions and/or data in a localized area of memory. In view of the phenomena of locality of reference, a small, high speed cache memory may be provided for storing data blocks containing data and/or instructions from main memory which are presently being processed. Although the cache is only a small fraction of the size of main memory, a large fraction of memory requests will locate data or instructions in the cache memory, because of the locality of reference property of programs.
Unfortunately, many programs do not exhibit sufficient locality of reference to benefit significantly from conventional caching. For example, many large scale applications, such as scientific computing, Computer-Aided Design (CAD) applications and simulation, typically exhibit poor locality of reference and therefore suffer from high cache miss rates. These applications therefore tend to run at substantially lower speed than the processor""s peak performance.
In an attempt to improve the performance of a cache, notwithstanding poor locality of reference, xe2x80x9cpredictivexe2x80x9d caching has been used. In predictive caching, an attempt is made to predict where a next memory access will occur, and the potential data block of memory is preloaded into the cache. This operation is also referred to as xe2x80x9cprefetchingxe2x80x9d. In one prior art embodiment prefetching includes retrieving serially increasing addresses from a current instruction. Serial prefetchers such as this are commonly used in a number of devices where there is a single data stream with such serially increasing addresses.
Unfortunately, predictive caching schemes may often perform poorly because of the difficulty in predicting where a next memory access will occur. Performance may be degraded for two reasons. First, the predicting system may inaccurately predict where a next memory access will occur, so that incorrect data blocks of memory are prefetched. Prefetching mechanisms are frequently defeated by the existence of multiple streams of data. Moreover, the prediction computation itself may be so computationally intensive as to degrade overall system response.
One predictive caching scheme attempts to dynamically detect xe2x80x9cstridesxe2x80x9d in a program in order to predict a future memory access. See, for example, International Patent Application WO 93/18459 to Krishnamohan et al. entitled xe2x80x9cPrefetching Into a Cache to Minimize Main Memory Access Time and Cache Size in a Computer Systemxe2x80x9d and Eickemeyer et al. xe2x80x9cA Load Instruction Unit for Pipeline Processorsxe2x80x9d, IBM Journal of Research and Development, Vol. 37, No. 4, July 1993, pp. 547-564. Unfortunately, as described above, prediction based on program strides may only be accurate for highly regular programs. Moreover, the need to calculate a program stride during program execution may itself decrease the speed of the caching system.
Another attempt at predictive caching is described in U.S. Pat. No. 5,305,389 to Palmer entitled xe2x80x9cPredictive Cache Systemxe2x80x9d. In this system, prefetches to a cache memory subsystem are made from predictions which are based on access patterns stored by context. An access pattern is generated from prior accesses of a data processing system processing in a like context. During a training sequence, an actual trace of memory accesses is processed to generate unit patterns which serve in making future predictions and to identify statistics such as pattern accuracy for each unit pattern. Again, it may be difficult to accurately predict performance for large scale applications. Moreover, the need to provide training sequences may require excessive overhead for the system.
A system, method and article of manufacture are provided for retrieving information from memory. Initially, processor requests for information from a first memory are monitored. A future processor request for information is then predicted based on the previous step. Thereafter, one or more speculative requests are issued for retrieving information from the first memory in accordance with the prediction. The retrieved information is subsequently cached in a second memory for being retrieved in response to processor requests without accessing the first memory. By allowing multiple speculative requests to be issued, throughput of information in memory is maximized.
In one embodiment of the present invention, a total number of the prediction and/or processor requests may be determined. As such, the speculative requests may be conditionally issued if the total number of the requests exceeds a predetermined amount. As an option, the speculative requests may be conditionally issued based on a hold or cancel signal. To this end, the hold and cancel signals serve as a regulator for the speculative requests. This is important for accelerating, or xe2x80x9cthrottling,xe2x80x9d operation when the present invention is under-utilized, and preventing the number of speculative requests from slowing down operation when the present invention is over-utilized.
In another embodiment of the present invention, the step of predicting may include determining whether the future processor request has occurred. Accordingly, the predicting may be adjusted if the future processor request has not occurred. More particularly, the predicting may be adjusted by replacing the predicted future processor request. As an option, a miss variable may be used to track whether the future processor request has occurred. By tracking the miss variable and replacing the predicted future processor request accordingly, a more efficient prediction system is afforded.
In still another embodiment of the present invention, a confidence associated with the future processor request may be determined. Further, the speculative requests may be issued based on the confidence. More particularly, the confidence may be compared to a confidence threshold value, and the speculative requests may be issued based on the comparison. As an option, a confidence variable may used to track the confidence. Moreover, the confidence threshold value may be programmable. As such, a user may throttle the present invention by manipulating the confidence threshold value. In other words, the total amount of speculative requests issued may be controlled by selectively setting the confidence threshold value.
In still yet another embodiment of the present invention, it may be determined whether the information in the second memory has been retrieved by the processor requests. Subsequently, the information may be replaced if the information in the second memory has not been retrieved. Similar to before, a variable may be used to track whether the information in the second memory has been retrieved by the processor requests.
In various aspects of the present invention, the processor requests may be monitored in multiple information streams. Further, the information may include graphics information. Still yet, the first memory may include dynamic random access memory (DRAM), and the second memory may include cache memory.
These and other advantages of the present invention will become apparent upon reading the following detailed description and studying the various figures of the drawings.