Traditional cache based memory structures are hardware controlled. Although they are quite helpful to increase the speed of an application program, they also have several drawbacks. Cache memories do not always fit into embedded systems as they increase the system size and the energy cost. Due to the fact that e.g. more data than required is transferred and that a tag access and a comparison per access are needed, cache memories are indeed not extremely energy efficient.
Traditionally, cache memory is categorized in “levels” that describe its closeness and accessibility to the microprocessor. Level 1 (L1) cache is extremely fast but relatively small, and is usually embedded in the processor chip (CPU). Level 1 cache typically includes a data memory (DL1) and an Instruction memory (IL1). Level 2 (L2) cache is often more capacious than L1; it may be located on the CPU or on a separate chip or coprocessor with a high-speed alternative system bus interconnecting the cache to the CPU, so as not to be slowed by traffic on the main system bus. Level 3 (L3) cache is typically specialized memory that works to improve the performance of L1 and L2. It can be significantly slower than L1 or L2, but operates usually at double the speed of RAM. In the case of multicore processors, each core may have its own dedicated L1 and L2 cache, but share a common L3 cache. When a memory location is referenced in the L3 cache, it is typically elevated to a higher tier cache.
FIG. 1 illustrates the block based transfers in a cache memory. When looking for a required data word, first cache level L1 is checked (FIG. 1A). If the word is not found, one has a L1 cache miss. A complete data block (i.e. several words) is then fetched from the next cache level. FIG. 1B shows what happens while looking for the required L1 block in the second cache level. If the block is not present, there is again a L2 cache miss. A complete L2 block (several words) is then fetched from the next level, i.e. the main memory. FIG. 1C finally shows how the L2 miss is first solved and next the L1 miss and how the word is eventually delivered.
FIG. 2 shows the main blocks of a n-way associative cache. When a new address (of x bits) is presented to the cache controller, the m central bits are used to determine which set of blocks of the cache must be checked. Every tag associated to each block of the set is read and driven to a comparator (there will be n tags per set in a n-way associative cache). Each of the read tags are compared with the x-m-k most significant bits of the address. If one (at most one) of the comparisons returns true, the access is a cache hit. The data block associated with the successful tag is read and the required word (selected by the k lowest significant bits) is sent to the upper layer (maybe the CPU). If none of tag comparisons succeeded, the access is a cache miss and the request is forwarded to the next level of the hierarchy.
ScratchPad Memories (SPMs) have been proposed. Scratchpad memory generally refers to a class of high-speed local memory typically used for temporary storage of data during application execution. SPMs have several features. Like caches, scratchpad memories comprise small, fast SRAM, but the main difference is that SPMs are directly and explicitly managed at the software level, either by the developer or by the compiler, whereas caches require extra dedicated circuits. Hence, SPMs are software controlled on-chip memory and do not include additional hardware logic for managing their content. Compared to cache, SPM requires up to 40% less energy and 34% less area. Additionally, SPM cost is lower and its software management makes it more predictable, which is a desirable feature for real-time systems.
Scratchpad memories are commonly encountered in processors in embedded systems as an alternative or supplement to caches (e.g. in Nvidia GPUs). Data are commonly transferred between scratchpad locations and main memory using direct memory access (DMA) instructions, in contrast to being copied, as in the hardware coherence strategies of most caches. Only the data is kept while tag arrays and block-wise transfers are removed. It is up to the user or operating system to decide which data should be placed in the SPM and when they are to be transferred.
Data management at cache level is traditionally called the stack. The stack contains small data which is frequently exchanged with the processor. The stack is mainly required to enable function or procedure calls (and nesting of calls). Register spilling (i.e. copying register values to the stack) is also one of the roles of the stack.
Recently, a wide variety of approaches for software data management of the stack in a scratchpad memory complementary to the data cache has been proposed. The exploration space can be categorized according to five criteria: granularity, the amount of stack in SPM, the placement decision, stack migration and hardware support. These options are described in more detail below.
Various levels of granularity are possible.
Every local variable may be allocated in the SPM or main memory.
Stack frames are somehow partitioned (not at the variable level) and each part may be independently allocated to the SPM.
An allocation per stack frame is performed. At a given time, one stack frame is either in the SPM or in the main memory.
Allocation is done per fixed slot (a page, for example). One slot may contain more than a stack frame. A stack frame can be in more than one slot.
An allocation decision is taken on several stack frames at a time. The complete set is either in the SPM or in the main memory.
The second criterion relates to the amount of stack in the SPM. In one option, 100% of the stack accesses are to the SPM; the current stack frame resides in the SPM. Alternatively, some stack frames never can be in the SPM.
The stack placement decision can be fully static, whereby the analysis and decisions are taken at compile time and nothing is left to runtime. Alternatively, the placement decision can be fully dynamic, whereby both, the analysis and actual placement decision is performed at run time. As a third option, a hybrid scheme could be implemented where most analysis is done at compile or design time (i.e. any phase before execution), but the actual placement decision (if any) is taken at run time using both design-time and run-time information.
Stack migration can either be allowed or not. If it is not allowed, an allocation unit is not copied back to the main memory, once it is placed in the SPM. In case stack migration is allowed, a stack frame (e.g. a stack frame of a parent function) can be created in the SPM and later copied to the main memory to create room for other stack frames (e.g. stack frames of the child functions). Later, when coming back to the parent function, the stack frame could be transferred back to SPM (or not).
Finally, in terms of hardware support, a pure software approach is an option, whereby at compile time, code is inserted/linked which enforces the placement decisions, so that hardware support may not be required. Another option is that no code (source or binary) modification is performed, nor libraries linked. This can be middleware enabled (the operating system or similar interact with custom hardware to decide/enforce decisions) or purely hardware. In a hybrid solution, part of the code inserted/linked may rely on specific hardware (from DMA to some other specialized devices).
U.S. Pat. No. 8,996,765 B2 relates to the management of workload memory allocation. A workload manager identifies a primary and a secondary memory associated with a platform. The secondary memory has performance metrics different from performance metrics of the first memory. A workload controller identifies access metrics associated with a set of data elements invoked by a workload during execution of the platform. A data element performance calculator prioritizes a list of the data elements based on the access metrics associated with corresponding data elements. A memory manager reallocates a first data element of the set from the first memory to the secondary memory based on the priority of that first data element.
U.S. Pat. No. 9,015,689 B2 discloses a stack data management for software managed multicore-processors. Stack data management calls are inserted into software in accordance with an integer linear programming formulation and a smart stack data management heuristic. The calls may be inserted in an automated fashion by a compiler utilizing an optimized stack data management runtime library.
In “A novel technique to use scratchpad memory for stack management” (Soyoung Park et al, DATE 2007, pp. 1478-1483), the authors propose a circular buffer management of the stack into the SPM but entirely HW controlled, by using the Memory Management Unit (MMU). The stack virtual space is split into pages. The stack frame holding the top of the stack is always mapped to the SPM. Pages above the SPM virtual area are mapped as invalid, such that when the program tries to read/write from them, an exception happens. In the exception handler, some backup copies (frames from SPM to main memory) could happen to make room for the required stack variables. This technique has a granularity whereby the allocation decision is taken on several stack frames at a time. All stack accesses go to the SPM, stack migration is allowed and there are no code modifications nor linked libraries. The solution is entirely in hardware. The handling of pointer-to-stack problems is transparent (virtual address never changes). All stack frames are allocated to SPM, while this may likely not be optimal for the first levels of the call graph, as the main memory SPM traffic increases due to copies. The size of the slot may be limited by the architecture minimal virtual memory page size. They assert 1 kByte slots are used for the stack by using 1 kbyte pages for the stack region. This is not possible in ARM processors without (significant) MMU modifications.
In the paper “Implementation of Stack Data Placement and Run Time Management Using a ScratchPad Memory for Energy Consumption Reduction of Embedded Applications” (Lovic Gauthier et al, IEICE Transactions 94-A(12), pp. 2597-2608, 2011), a compiler controlled strategy to place certain stack frames (or part of them) in a scratchpad memory the data cache is adopted. An Integer Linear Programming (ILP) formulation is developed to decide which frames (or parts thereof) are to reside in the SPM. A given stack frame may reside in the SPM for certain invocations and elsewhere in the memory organization for others. The allocation of the stack frames is controlled by a management code inserted before/after the function call. This approach comes with a performance penalty due to the execution of the inserted management code. Furthermore, there is almost no energy gain from moving stack frames at run time (compared with a fixed stack allocation).
A research group at the University of Maryland has published several papers on scratchpad exploitation. In “An optimal memory allocation scheme for scratch-pad based embedded systems” (O. Avissar et al., ACM Trans. Embedded Comput. Syst. 1(1), pp. 6-26, 2002), the placement of global and stack variables in the SPM is performed based on their frequency-per-byte (FPB), obtained by source code profiling. A distributed stack with two explicit stack pointers (one for main memory and the other for SPM) is maintained. The paper “Dynamic allocation for scratchpad memory using compile-time decisions” (S. Udayakumaran et al., ACM Trans. Embedded Comput. Syst. 5(2), pp. 472-511, 2006) addresses the placement of global variables, stack variables and code into SPM. The program is divided into regions (namely: functions, loops and if conditions) and potential transfers will be included at the entry and exit points of the regions. Program profiling is used to gather variable usage information per region. The SPM contents can only change in the boundary of two regions (it remains constant during region execution). The approaches of these two papers are very flexible (with variable granularity), however they require a compiler.
Apart from stack data management, there is heap data management. Heap objects are allocated in programs by dynamic memory allocation routines, such as malloc in C and new in Java. They are often used to store dynamic data structures such as linked lists, trees and graphs in programs. Many compiler techniques for heap analysis group allocate all heap objects at a single site into a single heap ‘variable’. Additional techniques such as shape analysis have aimed to identify logical heap structures, such as trees. Finally, in languages with pointers, pointer analysis is able to find all possible heap variables that a particular memory reference can access. Heap data is in general difficult to allocate in scratchpad memory. Heap variables usually have an unknown size at compile time, which makes it difficult to guarantee at compile time that they will fit into the scratchpad memory. Further, moving data at runtime (as is required for any dynamic allocation to scratchpad) usually leads to the invalid pointer problem if the moved data is a heap object. Static methods avoid this problem, but obviously lack the benefits of dynamic methods.
The paper “Heap data allocation to scratch-pad memory in embedded systems” (Dominguez et al., J. Embedded Computing, Vol. 1, Issue 4, December 2005, pp. 521-540) discusses compile-time methods for allocating heap data to SPM. The proposed approach has similarities with their compile-time method for global and stack data placement to SPM. It allows for dynamic movement of heap data in and out of the SPM to better adhere to the program's behavior. Also, it does not need any additional instructions for the address translation per memory access and it avoids extra tags. Source code information is needed. The program is partitioned into regions (based on loops, start/end of procedures etc.) and then an analysis is performed to find the time order of regions. The compiler is used to insert code that copy portions of the heap in the SPM at the start of each region. The size (and the variables that are copied) is determined by a cost model and information gained through profiling about the frequency of accesses per region.
The paper “Adaptive Scratchpad Memory Management for Dynamic Behavior of Multimedia Applications” (Cho et al, IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 28, issue 4, pp. 554-567, 2009) tackles the issue of data reusability for applications. It is based on hardware-software cooperation. A profiling is performed to find the most heavily used addresses. The hardware component is a data access record table (DART) that records the runtime memory access history in order to support runtime decisions concerning which regions of data block to place onto the SPM. These memory locations (WML) are placed in the SPM. Different data layouts are created based on the different input sets and a layout is selected. During runtime the selected layout can change thanks to the hardware component (DART). The analysis to extract the layout is more complex by calculating iteration vectors based on the loop iteration number where the regions are accessed.
Hence, there is a need for an energy efficient on-chip memory hierarchy for a system-in-package allowing flexible data allocation across the memory hierarchy.