1. Field of the Invention
The present invention relates to a cache device for coupling to a memory device, and to a method of operation of such a cache device. In particular the present invention relates to a cache device for coupling to a memory device arranged as a plurality of blocks of memory locations, where each block has to be activated before any data value stored in that block can be accessed.
2. Description of the Prior Art
Many system memories (including, for example, DRAM) feature a property where sequential accesses are processed faster than non-sequential accesses. Many system memories (especially DRAM) also have the concept of page access, where access to a previously opened page is faster than to one that has never been accessed. Further, memories typically have a limit on the number of pages being open at one time, having to close a page in order to open a new one.
For performance and energy consumption reasons, it is therefore beneficial to extract the highest level of page reuse from a stream of memory transactions as possible. One type of operation that will generate a stream of memory transactions is a block copy operation (for example memory copy and set operations). Due to the volume of memory transactions associated with such operations, the efficiency of block copy operations is important to system performance. In principle a memory copy operation (also referred to herein as a memcpy operation) features two transaction streams, namely one read transaction stream and one write transaction stream. However, often the master device (e.g. a central processing unit (CPU)) performing the memcpy operation will not be directly coupled to the system memory, but instead will be coupled to a memory controller (that interfaces to the memory) via one or more levels of cache. When filtered through a modern cache design, these two transaction streams are transformed into three fundamental streams, namely a read stream, an eviction stream due to read allocations in the cache, and an eviction stream due to write allocations in the cache. Furthermore the schemes used to determine which cache line is evicted often introduce a pseudo random property and as a result the addresses of the two eviction streams tend to have little correlation. This leads to poor page reuse in the memory controller interfacing to the memory device.
The above problem will be discussed in more detail with reference to FIGS. 1 to 3. FIG. 1 illustrates a data processing system 10 taking the form of a System-on-Chip (SoC). Within the SoC, a CPU 15 is provided that is coupled via a cache 20 to a SoC bus 25. A memory controller 30 is also connected to the SoC bus for interfacing with a memory device 50 via a memory interface bus 35. In this example the memory device takes the form of a DRAM. A DRAM memory is typically organized in banks (typically 4-8 per rank), and in the example of FIG. 1 four banks 80, 85, 90, 95 are shown. Each bank contains an array of memory bit cells 82, and has an associated array of sense amplifiers 60, 65, 70, 75 which serve to sense signal values upon bit lines 84 passing through the memory banks and representing bit values stored within the bit cells 82 of a row of bit cells. The row of bit cells is selected for access by wordline signal WL asserted across the memory banks 80, 85, 90 and 95. This type of arrangement of a memory 50 will be familiar to those in this technical field and will not be described further herein.
The output from the sense amplifiers 60, 65, 70, 75 is stored into a buffer store 55 within the memory 50. This buffer store 55 has a number of row buffers, each row buffer containing the bit values from a complete row of bit cells as selected by the wordline signal WL. While the bit values continue to be stored in the buffer store 55, subsequent accesses to those same bit values will take place via the buffer store 55 rather than requiring a fresh read of the row of bit cells selected by the wordline signal WL. When a new row of bit cells is to be accessed, then to make space in the buffer store for that new row, the contents of a row buffer within the buffer store 55 will be written back into the row of bit cells concerned if necessary, for instance if any of the bit values have been changed or if the initial read of the bit cells concerned was destructive.
Each bank 80, 85, 90, 95 will comprise a plurality of rows and a plurality of columns, each row holding a single page of data, and the width of a column being equal to the width of the data bus on the DRAM chip. Each bank has a row buffer associated with it that can hold data from a single row (page) in the bank. Data is fetched from a DRAM bank in page-granularity (typically of size 4 KB, 8 KB or 16 KB). The data in the DRAM is stored as charge on a leaky capacitor (thus requiring regular refreshes). The charge on the individual capacitors is so small that a single read of the data discharges the capacitor completely, i.e. a single read actually erases the data that was stored in the capacitance. This data is however then captured in the row buffer associated with the bank. This operation of fetching a page of data from the leaky capacitors and putting it in a row-buffer is called an ACTIVATE (RAS command). This operation is illustrated schematically in FIG. 2A, where in response to a RAS command, an 8 Kbyte page of data is read from a DRAM bank into the row buffer 100.
Thereafter, individual columns of data (in the FIG. 2A example each column 110 being 32 bits wide) can then be read from/written to the page that has been fetched into the buffer using read/write (RD/WR) CAS commands, causing 32-bit wide blocks of data to pass over the memory interface 35 between the row buffer 100 and the memory controller 30.
The total capacity of the DRAM far exceeds the total capacity of row buffers available in the buffer store 55. If a particular cache line needs to be read from/written to DRAM and the bank that the cache line's page would map to has another row/page in its row buffer, then the page that is currently open needs to be written back to its location on the capacitance store, before the desired page can be fetched. This operation of putting a page back on the capacitance store is called a PRECHARGE operation. This situation, where the bank that data needs to be read from has a different page open, is called a PAGE MISS condition. If the row buffer was empty (holding no data) then no data needs to be written back to the capacitance store (i.e. no precharge operation is required). However the appropriate page does need to be fetched into the row buffer using an ACTIVATE command. This situation is called a PAGE EMPTY condition. If the page that needs to be accessed is already in the row buffer then it is termed a PAGE HIT condition. To summarize, in case of a PAGE MISS condition, three DRAM operations need to be issued, namely precharge, activate (RAS), and RD/WR CAS operations. In case of a PAGE EMPTY condition, two DRAM operations need to be issued, namely activate (RAS) and RD/WR CAS operations. In the event of a PAGE HIT condition only one DRAM operation is needed, namely a RD/WR CAS operation.
It should be noted that no data is on the data bus when a precharge or activate command is issued and each command takes multiple cycles to complete. To achieve 100% bus utilization it is necessary to have 100% page hits. There are various known techniques that try to maximize page hits/bus bandwidth, in particular increasing the number of banks per rank, increasing the page size, and/or mapping pages to banks (using appropriate address bits) such that there is minimum interference and thus fewer page misses. Further, as illustrated in FIG. 2B, the memory controller 30 will typically have a buffer 150 for storing pending memory accesses, and the memory access manager 160 may perform some reordering of the memory accesses having regard to a list of open rows 170 in order to try and optimise page hits within the DRAM. However, such techniques do not avoid inefficiency problems when handling large sequences of accesses such as can occur when performing block copy operations such as memory copy and set operations.
Memory copy and set operations are a critical part of system performance. It is common for these to use standard c library functions such as memcpy and memset that have been optimized to take advantage of architecture features—in the case of designs developed by ARM Limited, features such as preload (PLD) have been used to create optimized ARM memcpy routines.
A memcpy routine copies a block of data from a source region of memory to a destination region. The routine is provided with pointers of where to read the source data from and where to write the destination data to. In terms of memory requests this results in one read operation and then one write operation to transfer a single item of data (typically a word of data). In a simple system with no cache this generates two address streams into memory and the DRAM devices. With DRAM devices typically having the capability of having up to 4 banks open at any one time, the read and write streams can occupy a bank without necessarily conflicting with each other (note however that they can still conflict depending on actual address of the source and destination). Hence, if the source and destination address each occupy a different DRAM bank, then every sequential transfer can happen and achieve an open page hit in their respective banks, and hence achieve high memory utilization rates.
However in a cached memory subsystem the address patterns to memory are not as sequential and predictable. If the cache is already full of data items then the current data in the cache has to be evicted to make room for the memcpy read data and similarly the write data causes evictions to make room for the memcpy write data. The read data still has to be fetched from external memory but with modern cache allocation polices the write data is usually allocated directly into the cache, and with optimized allocation polices, if sequential write activity is detected, then the line being written to does not have to be fetched from external memory (if it is known that the entire line will be written to, then it does not need to be fetched as all the data will be replaced.)
As illustrated in FIG. 3A, the address streams that are sent to DRAM are as follows:
Read address stream to get data to be copied;
Write eviction stream to make space for read data; and
Write eviction stream to make space for write data.
In this example it is noted that the write data itself does not go to DRAM as this is write allocated directly into the cache.
The memory traffic is hence one read and two write eviction streams. The data being evicted is effectively random data out of the cache so may have no correlation from one eviction to the next. A stream of evictions to random addresses will likely cause the DRAM to have page misses that will cause the DRAM to precharge and activate, which can cost typically 10 cycles of DRAM time where no useful work can be undertaken. This results in reduced efficiency of the DRAM interface. The random write eviction data may actually cause more conflicts (and hence reduced efficiency) with the sequential read stream too. The problem is compounded by the fact that there are two random write eviction streams (one caused by the memcpy reads and one caused by the memcpy writes). These tend to hide the regular nature of the read stream due to the three streams being multiplexed together when creating the stream of accesses to the DRAM.
FIG. 3B shows the pin activity of a DRAM during a memcpy operation. This shows the read transaction and the write eviction streams. The two read data transactions 180 are sequential so get a page hit, and fortunately the corresponding write evictions 185 happen to be sequential in this case (by chance). However, the other write transactions 187, 189 each require a precharge and activate. The time 190 occupied by the precharge and activate operations prevents any useful use of the pins and therefore reduces the utilization of the pins. In this case the utilization is 58%. If the write eviction traffic exhibited more sequential behaviour then some of the precharge and activates could be removed thus increasing pin utilization.
Several areas of research have attempted to address the problem of optimizing DRAM memory access latency and bandwidth utilization. Rixner et al have discussed various policies to reorder accesses to memory to improve bandwidth utilization. In general prioritizing accesses that hit on open rows helps improve bandwidth utilization, as discussed for example in Rixner, S., Daily, W J., Kapasi, U. J., Mattson, P., and Owens, J. D. “Memory access scheduling” in Proc. of the 27th Intl. Symp. on Comp. Arch. (Vancouver, British Columbia, Canada). ISCA '00. 128-138, and Rixner, S. “Memory Controller Optimizations for Web Servers” in Proc. of the 37th Intl. Symp. on Microarchitecture (Portland, Oreg., Dec. 4-08, 2004). Micro '04. 355-366. Rixner also considers the tradeoffs of using Virtual channel SDRAMs (where a small SRAM cache formed from partial segments of row buffers is managed separately by the memory controller) in the latter article.
Lin, Reinhardt and Burger have proposed prioritizing prefetch data such that those that hit in row buffers are pushed out first and prefetch requests are sent out only when the data bus is idle, as discussed in Lin, W.-F., Reinhardt, S. K. and Burger, D. “Reducing DRAM latencies with an integrated memory hierarchy design” in Proceedings of the 7th International Symposium on High Performance Computer Architecture, pages 301-312, January 2001. Mckee and Wulf developed analytical timing models for blocked and streaming access patterns and analysed the tradeoffs of prefetching various sizes of these patterns with a view to hide miss latencies on DRAM row buffers, as discussed in Mckee, S. A. and Wulf, W. A. “Access ordering and memory-conscious cache utilization” in Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture (Jan. 22-25, 1995).
Ordering of write-destages from the write-cache in a storage controller has been looked into to improve the overall throughput, as discussed in Gill, B. S. and Modha, D. S. “WOW: wise ordering for writes—combining spatial and temporal locality in non-volatile caches” in Proceedings of the 4th Conference on USENIX Conference on File and Storage Technologies—Volume 4 (San Francisco, Calif., Dec. 13-16, 2005).
Performance trade-offs of various DRAM architectures like EDO DRAM, SDRAM, ESDRAM, Rambus etc have been studied by Cuppu et al, as discussed in Cuppu, V., Jacob, B., Davis, B., and Mudge, T. 1999. “A performance comparison of contemporary DRAM architectures” in Proceedings of the 26th Annual international Symposium on Computer Architecture (Atlanta, Ga., United States, May 1-4, 1999). The differences in these architectures is largely in the handshake and interface between the memory controller and DRAM, the row (page) buffer organization and the data bus width. They show the row buffer hits rates against SPEC benchmarks for various organizations for read only traffic and read-write traffic. In read only traffic the write traffic is ignored by the memory controller—read only traffic shows a lot more row buffer hits. To quote the authors “Write traffic, due to writebacks, disrupts the locality of the address stream for architectures with fewer internal banks”. The paper is a bit dated so the DRAM sizes and cache sizes considered are also dated. In general as long as the LLC is larger than the total size of row buffers on the DRAM, random write-back traffic is bound to disrupt locality.
Wong discusses swapping row address and column address bits in an attempt to improve row (page) hit rate, as discussed in Wayne A. Wong “Techniques Utilizing Memory Reference Characteristics for Improved Performance”, PhD Thesis. University of Wisconsin. Zhang et al show that when a cache line is being evicted due to a read to a conflicting cache line, the line being evicted is bound to conflict with the newly fetched line in the row buffers as well (i.e. cause a page miss), as discussed in the article Zhang, Z., Zhu, Z., and Zhang, X. “A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality”, in Proceedings of the 33rd Annual ACM/IEEE international Symposium on Microarchitecture (Monterey, Calif., United States). They propose a scheme to reduce conflicts by XOR-ing the middle order bits that are normally used for bank id with upper order bits of the row address (page id) to determine the actual bank id.
Nesbit et al propose to schedule requests coming from different threads in a multi-core system by estimating the finish times of various requests in a system where each thread has a bus of its own with a fraction of the actual bandwidth. Requests that have earlier finish times are prioritized, as discussed in Nesbit, J., Aggarwal, N., Laudon, J. and Smith, J. E. “Fair Queuing Memory Systems”, in Proc. of the 39rd ACM/IEEE Symposium on Microarchitecture (MICRO), 2006. Wong and Baer consider organizing the row buffers as a regular multiway-associative cache and presents hit rates for various line sizes and associativities, as discussed in Wayne A. Wong and Jean-Loup Baer. “Dram caching”, Technical Report 97-03-04, Department of Computer Science and Engineering, University of Washington, 1997. They also presents a set of ‘early’ precharge policies like ‘precharge after one read’ precharge after two reads' etc to hide the precharge latency and hence improve access latency.
As will be apparent from the above discussion, most of the previously proposed schemes to improve DRAM bandwidth utilization tackle the problem by suggesting optimizations in the memory controller, for example scheduling requests to improve hit rate, determining bank id and row/column address by selecting the right combination of physical address bits, proactively closing pages before a row buffer conflict is encountered to hide the latency of precharge commands, etc. A few researchers have proposed aligning prefetches with an awareness of the DRAM organization.
The Cortex A9 processor developed by ARM Limited has a feature whereby on detection of three sequential write operations, the level 1 cache is reconfigured into a non-write-allocate mode (regardless of requests in cache allocation bits). This means that the following sequential write traffic is not cached in the level 1 cache. If there were no intervening cache levels between the level 1 cache and the memory, this would mean that a series of sequential writes would be passed in their original order to memory. However, typically there will be intervening cache levels, such as a level 2 cache, which would cache such accesses, and later evict them to memory, whereby the above described problems would still arise. Further, operations such as memcpy operations involve a stream of sequential reads as well as a stream of sequential writes, and the sequential reads will still be cached in the level 1 cache, giving rise to problems due to the random nature of the write eviction stream used to make way for the read data in the cache.
It is an object of the present invention to seek to improve processor efficiency and memory utilization when external memory devices are connected to the processor, when those memory devices contain blocks that need to be activated before they can be accessed.