1. Field of the Invention
The current invention generally relates to computer memory systems. More particularly, the current invention relates to computer memory systems where an application writes a block of data into memory in a first order and then reads the block of data in a second order.
2. Description of the Related Art
Computing system hardware and software designers are constantly addressing the problem of how to get data transferred between storage and a processor as quickly as possible. Processor speeds have dramatically increased over the years as processor technology has improved, leading to faster processor cycle times and increased processor densities. Although memory density has also dramatically improved, performance of memory technology has lagged behind the performance of the processor.
Use of caching techniques has greatly helped in matching processor performance to memory performance for many applications. Caching involves a hierarchy of memories that depend on a computer application's tendency to reuse data at a particular address, or data that is “near” the particular address. This is called “locality of reference”. Typically, a very fast but relatively small first level cache is provided on the same chip that the processor is built. For example, a first level cache might be 64 KB (kilobytes) and provides data to the processor in one or two processor cycles. A first level cache is built using SRAM (static random access memory) in virtually all current designs. Many processors have two first level caches, one for instructions, and one for data. Second level caches are typically larger and slower than first level caches. In current technology, second level caches are also constructed on the same chip that the processor is constructed on. A second level cache is typically 128 KB to a megabyte (MB). Second level cache, like first level cache, is typically built using SRAM technology. Third level caches are larger and slower than second level caches, often using DRAM (dynamic random access memory) technology, although some current computer systems utilize SRAM for the third level cache.
Cache implementations read and write blocks of memory called cache lines. Cache lines typically contain from 32 bytes to 256 bytes. When the processor needs data at address “x”, the first level cache is checked to see if it holds a cache line containing address “x”. If so, the data at address “x” is read from the first level cache and is made available to the processor. If the first level cache does not hold the cache line containing address “x”, (called a cache miss) the second level cache is checked. If the second level cache holds the cache line containing address “x”, that cache line is read and typically moved to the first level cache, with the data at address “x” made available to the processor. Similarly, if a cache miss occurs in the second level cache, the cache hierarchy is further checked until the cache line containing the requested address is found, perhaps in main memory, and the cache line containing the requested address is then copied down the cache hierarchy. Cache hierarchies work well as long as data addressing patterns have a high degree of locality of reference. Locality of reference means that if a particular data element has been referenced, in many applications, it is likely to be referenced again in the near future. Also, if a particular data element has been referenced, it is likely, in many applications, that another data element having an address that is very close to the particular data element will be used in the near future.
Some applications do not have a high degree of locality of reference. For example, a particular scientific application reads data in a sequential manner. In the example, the data comprises a block of readings from a sensor, the block of data making up a mathematical matrix. Mathematical operations are subsequently performed on the data in the mathematical matrix, often not in the same order that the data was written into the mathematical matrix. To illustrate further, consider a two dimensional matrix, x(32,32), in the notation of the FORTRAN programming language. Data is written sequentially into the matrix (in Fortran) as x(1,1), x(2,1), x(3,1) . . . x(32,1), x(1,2), x(2,2), and so on. Many matrix operations will address elements in the matrix in a different order, such as making sequential reads to every 32nd data element in the matrix in the example. Making such regular access to relatively widely separated data elements is called “striding”. In the example, the “stride” is 32 data elements. (Note that in various computer systems a data element could be a byte, a 32-bit word, a 64-bit double word, or any other suitably defined piece of data used by the processor).
Striding in a computer system often makes a cache hierarchy counterproductive. For example, assuming the block of data (e.g., the matrix) is too big to be contained in a level of cache (in particular, the first level cache), entire cache lines must be moved into and later moved from the level of cache to satisfy a need for a small portion of the content of the cache lines. Suppose that a mathematical matrix operation needs an eight byte data element in a 128 byte cache line. Due to a long stride, the cache line will not be again accessed for some time. The entire 128 byte cache line is moved into the cache, the 8 byte data element is used, and the cache line is later replaced before any of the other 120 bytes are used. This operation results in sixteen times (i.e., 128/8) the traffic on a data bus supplying the cache data than is required.
Modern memories in computer systems are usually made up of groups and banks. A group comprises a plurality of banks. In modern DDR-2 DRAMs (dynamic random access memory) each DRAM module can have four banks of memory. In addition to making the cache hierarchy counterproductive, striding often causes sequential reads to a single group or even a single bank in a particular group. Banks within a group typically require a significantly long time interval between a first access and a following access. Typically there is a single data bus from a memory control unit to a particular group; therefore, repeatedly accessing the same group, and particularly the same bank within a group, can dramatically delay data needed by the processor.
Therefore, there is a need for a method and apparatus that provide for more efficient handling of striding requirements.