Hierarchical memory systems are known for a variety of essentially digital systems which comprise for example a processor and memory for use with the processor. A conventional system is described in “VLSI memory Chip Design”, Kiyoo Itoh, Springer Verlag, 2001, especially chapter 6. A multi-level memory architecture for a personal computer is shown in FIG. 6.1 of this book. It comprises a processor with an on-chip cache memory L1, an off-chip cache memory L2 and a main memory controlled by a memory controller and connected to the processor by a processor bus. In addition, a magnetic hard disc memory is accessible via a system bus and is controlled by a hard disc controller. The level 1 on-chip cache L1 can be SRAM, the level 2 off-chip cache L2 can also be SRAM and the main memory can be DRAM. Since computer programs access a relatively small portion of their address space at any instant, items close to an already accessed item are likely to be accessed in the near future. To take advantage of this spatial locality a cache memory must have a block size larger than one word. However, if the block size is increased too much, the time taken to load the block from a memory in a lower level increases. There is thus a trade-off between block size and the number of levels in the hierarchical memory. One proposed technique to reduce the cache miss penalty is to use a bank of memories and to interleave the words with the banks. This means that if an item is not available from one bank it is likely to be available from another bank, as adjacent banks have adjacent words to the last accessed word.
Hierarchical memories can also be used in embedded applications, as described for instance in “ARM system-on-chip architecture”, S. Furber, Addison-Wesley, 2nd Ed. 2000, especially chapter 10 on memory hierarchy. In particular, the ARM processors support paging. A page is usually a few kilobytes in size but different architectures have different sizes. Overhead can be reduced by using a look-aside buffer which is a cache of recently accessed page translations. The spatial locality of typical program enables a reasonable buffer size to achieve a low miss rate.
Design-Time Data Assignment Techniques
For embedded systems, P. Panda in “Memory Bank Customization and Assignment in Behavioral Synthesis” Proc. Iccad, pages 477-481, October 1999 presents assignment algorithms to improve the performance of SDRAM memories. Both algorithms distribute data with a high temporal locality over different banks. In this way the time/energy penalty of page-misses is minimized. Their optimizations rely on the fact that the temporal locality in a single threaded application is analyzable at design-time. This is not the case in dynamic multi-threaded applications. The temporal locality between tasks depends on their actual schedule which is only known at run-time. This renders the techniques less useful.
The vector and stream processing community has spent much time and effort in researching optimal placement schemes—see, e.g. L. Kurian, “Data Placement Schemes to Reduce Conflicts in Interleaved Memories” Computer Journal, 43(2): 138-151, 2000—to improve the bandwidth of interleaved memories. However, these techniques focus only on performance and do not discuss other cost issues.
V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. Irwin, present in “Hardware and Software Techniques for Controlling DRAM Power Modes”, IEEE Trans. Computers, 50(11):1154-1173, November 2001, techniques to reduce the static energy consumption of existing multi-banked SDRAMs in embedded systems. Their strategy consists of clustering data structures which have a large temporal affinity in the same memory bank. A consequence the periods when banks are idle are grouped, thereby creating more opportunities to transition more banks in a deeper low-power mode for a longer time. The impact of this technique on the dynamic energy consumption and the performance is ignored.
Run-Time Memory Management Techniques
A scalable and fast multi-processor memory manager is presented by, e.g. E. Berger, K. McKinley, R. Blumofe, and P. Wilson, in “Hoard: A Scalable Memory Allocator for Multithreaded Applications”, Proc. 8th Asplos, October 1998, uses private heaps with a shared memory pool. However, the system is unaware of the cost of the underlying memory architecture.
In a typical application, the data structures which need to be allocated are only known at run-time and fully design-time based solutions as proposed earlier in the compiler and system synthesis cannot solve the problem.
Run-time memory management solutions as present in conventional operating systems are too inefficient in terms of cost optimization (especially energy consumption). They are also not adapted for the real-time constraints.
Low-power design is a key issue for future dynamic multi-media applications mapped on multi-processor platforms. On these architectures multi-banked memories (like e.g. SDRAMs) are big energy consumers. Their dynamic energy consumption is dominant. A crucial parameter which controls the energy consumption of these memories is the number of page-misses.