The field of the invention relates to the area of data pre-fetching from a computer memory and more specifically to the area of pre-fetching data from a computer memory in a manner to minimize processor stall cycles.
As microprocessor speeds increase, processor performance is more and more affected by data access operations. When a processor, in operation, needs to await data due to slow data retrieval times, this is termed a processor stall and, in quantitative terms is referred to as processor stall cycles. A larger number of processor stall cycles is indicative a longer delay.
Early computer systems, suffered from the limitation of magnetic storage media speed. As such, caching of disk drive data is well known to enhance data access performance. In a typical caching operation, data is fetched or pre-fetched from its storage location to a cachexe2x80x94a temporary but faster memory holdxe2x80x94for more rapid access by the processor. Thus, the speed limitations of bulk storage media are obviated if the entire stored data is cached in RAM memory, for example.
Presently, processors are so fast that processor stall cycles even occur when retrieving data from RAM memory. The processor stall cycles are used to increase the time to allow data access operations to complete. As would be anticipated, pre-fetching of data from RAM memory is now performed to reduce processor stall cycles. Thus, different levels of cache memory supporting different memory access speeds are used for storing different pre-fetched data. When incorrect data is pre-fetched into the cache memory, a cache miss condition occurs which is resolvable through processor stall cycles. Incorrect data pre-fetched into the cache memory may result into cache pollution; i.e. removal of useful cache data to make place for non-useful pre-fetched data. This may result in an unnecessary cache miss resulting from the replaced data being needed again by the processor.
Memory is moved in data blocks to allow for faster transfer of larger blocks of memory. A data block represents a basic unit of data for transfer into or between different levels of cache memory hierarchy; typically, a data block contains multiple data elements. By fetching a data block into a higher level of cache memory hierarchy before the data block is actually required by the processor, the processor stall cycles due to a cache miss are avoided. Preferably, the highest level of cache memory hierarchy is such that a data block pre-fetched into said level of cache memory hierarchy is retrieved by the processor without any stall penalty; this yields peak processor performance. Of course, data blocks that are to be retrieved and that are not yet present in the highest level of the cache memory hierarchy are either subject to pre-fetching before they are needed or reduce overall processor performance.
Advantageously, a goal of pre-fetching in a processor-based system is to reduce a processing time penalty incurred by processor cache misses. As such has been addressed in the prior art. For example, in U.S. Pat. No. 6,272,516, a method is disclosed where the use of multiple processors reduces cache misses. U.S. Pat. No. 5,761,506, entitled xe2x80x9cMethod and apparatus for handling cache misses in a computer systemxe2x80x9d, also discloses a manner in which cache misses are reduced.
In the paper entitled, xe2x80x9cImproving Processor Performance by Dynamically Pre-Processing the Instruction Stream,xe2x80x9d Dundas J. D., The University of Michigan 1997, multiple dynamic pre-fetching techniques are disclosed as well as methods for their use. State of the art pre-fetching techniques usually rely on certain regularity in references to data stored in RAM made by the instructions executed by the processor. For example, successive executions of a memory reference instruction, such as a processor load instruction, may refer to memory addresses separated by a constant value, known as stride. This stride is used to direct a pre-fetch of a data block contained in an anticipated future referenced memory address. Thus, pre-fetching exploits a spatial correlation between memory references to improve processor performance, where the spatial correlation between data blocks is used to improve processor performance. In some cases, within cache memory, spatial locality of the data blocks is useful to improve performance. Prior Art U.S. Pat. No. 6,079,006, entitled xe2x80x9cStride-based data address prediction structure discloses a data prediction structure that stores a base addresses and stride values in a prediction array.
Pre-fetching may be directed by software, by means of programming, by compiler inserted pre-fetch instructions, or may be directed by means of hardware. In the case of hardware directed pre-fetching, the hardware tries to detect regularity in memory references and automatically, without the presence of explicit pre-fetch instructions in the program stream, generates pre-fetching of data blocks. Combined hardware/software based techniques are also known in the prior art. Although the prior art pre-fetching techniques are intended to improve processor performance, there are some downsides to using them.
For example, successive references to memory addresses A, A+200, A+400, and A+600, may direct the prior art pre-fetch mechanism to pre-fetch the data block containing address A+800, assuming a stride of 200, when the data block is not yet present in the higher level of cache memory hierarchy and has not yet been requested.
The process of pre-fetching data blocks uses a bus, which provides for communication between the memory, in the form of RAM, and cache memory, and as a result pre-fetching of data blocks from the memory uses the bus and therefore increased bus utilization and decreases bus bandwidth. This process of pre-fetching may also result in the pre-fetching of data blocks that will not be used by the processor, thereby adding an unnecessary load to the bus utilization where another fetch may be necessary for the processor in order to obtain the required data. Fetching a data block into a certain level of the cache memory hierarchy requires replacing of an existing cache data block, where the replacing of such a data block may result in extra bus utilization. Often, the cache data blocks are re-organized such that the block being replaced is moved to a lower level of the cache memory hierarchy. Furthermore, the moved data block is no longer available at the highest level of cache memory hierarchy for future reference and may result in other cache misses.
On the other hand, pre-fetching of extra data blocks, in anticipation of their use by the processor, may also result in bursty bus utilization, where the pre-fetches are not spread in time but follow each other rapidly in succession. This problem is most apparent when a series of pre-fetches are initiated to fetch multiple data blocks that hold for example data relating to a two dimensional sub-structure of a larger two dimensional structure. Such as in the case of a cut and paste operation, where a sub graphic image is fetched from a larger graphic image laid out in memory in row-order format. Bursty bus utilization may cause temporary starvation of other processor components that require the shared bus resource, which may result in other types of processor stall cycles, thus having a degrading effect on processor performance. Software directed pre-fetching typically requires insertion of pre-fetch instructions into the program stream being executed by the processor, thereby decreasing processor instruction bandwidth. Hardware directed pre-fetching usually requires a non-negligible amount of chip area to detect regularity in memory references. In the prior art, the use of memories of several kilobytes to monitor memory references is not unknown for hardware based techniques. Such hardware techniques are employed such that pre-fetching of data blocks is initiated early enough so that the pre-fetch is completed by the time the pre-fetched data is actually required by the processor, otherwise the processor will stall to resolve the cache miss condition.
Efficient pre-fetching significantly improves processor performance, while attempting to limit the downsides. Therefore, it would be advantageous to have a pre-fetching technique which does not rely on dynamically detected regularity in data memory references made by instructions within the program stream, as well as having a pre-fetching technique that supports a low level of occurrences of stall cycles by the processor attributable to cache misses.
There exists a need to provide a hardware and software directed approach to pre-fetching of data in such a manner that the occurrence of processor stall cycles is reduced.
In accordance with the invention there is provided a processor for processing of instruction data including memory access instructions for accessing an external RAM memory comprising: a region stride memory location for storing of a pre-fetch operation stride; a memory region identifier for storing data indicative of a memory region within the external RAM memory within which to apply the stored pre-fetch operation stride; a pre-fetch circuit for pre-fetching of data from the external RAM memory, the pre-fetching circuit responsive to the pre-fetch memory stride when accessing data within the pre-fetch memory region for fetching from said memory using said stride.
In accordance with the invention there is also provided a method of pre-fetching data from external RAM memory, comprising the steps of: providing to a processor for use in storing within memory locations a stride and data determinative of a memory region within the external RAM, the stride for use in pre-fetching of data from within the memory region; determining a region within which the data is stored for being pre-fetched therefrom; determining the pre-fetch stride stored within the processor in association with the region; defining a data block having a size based on the pre-fetch stride and a start location based on a previously pre-fetched data block memory location within the external memory; and, copying the data block located at the start location to a destination memory location.
In accordance with the invention there is also provided method of pre-fetching a data block from data memory into buffer memory, comprising the steps of: providing a processor having memory therein and a pre-fetch circuit; providing random access memory; providing within the memory within the processor a lookup table having stored therein a relation between at least a region and a corresponding region stride; comparing a data memory reference instruction data memory access address to the at least a region stored within the lookup table to determine a region within which the data memory reference instruction data memory access address is located; and, providing the region stride associated with the region within which the data memory reference instruction data memory access address is located to the pre-fetch circuit of the processor.
In accordance with another aspect of the invention there is provided a storage medium having data stored thereon, the data indicative of executable instructions for performing the steps of: storing within memory within the processor data indicative of a first pre-fetch stride; and storing within memory within the processor data indicative of a first region of memory within which to employ the first pre-fetch stride.
In accordance with yet another embodiment of the invention there is provided a process for performing the step of: performing a memory allocation operation, the memory allocation operation dependent on a data type for storage within the allocated memory, the memory allocation operation including the steps of: allocating a memory region within memory, storing within memory within the processor data indicative of a first pre-fetch stride for the allocated memory region and storing within memory within the processor data indicative of the allocated memory region location and size within which to employ the first pre-fetch stride.