1. Technical Field
The disclosure relates to memory allocation algorithms for embedded systems. In particular, the disclosure relates to compiler-driven dynamic memory allocation methodology for scratch-pad based embedded systems.
2. Description of the Prior Art
In both desktops and embedded systems many different kinds of memory are available, such as SRAM, DRAM, ROM, EPROM, Flash, non-volatile RAM and so on. Among the writeable memories—SRAM and DRAM—SRAM is fast but expensive while DRAM is slower (often by a factor of 10 or more) but less expensive (by a factor of 20 or more). To combine their advantages, the usual approach is to use a large amount of DRAM to build capacity at low expense and then to speed up the program add a small amount of SRAM to store frequently used data. Using SRAM is critical to performance; for example, typically by adding a small SRAM results in an average of 2.5× lowering of runtime in a typical embedded configuration as compared with using DRAM only. This gain from SRAM is likely to increase since the speed of SRAM is increasing by 60% a year versus only 7% a year for DRAM.
In desktops, the usual approach to adding SRAM is to configure it as a hardware cache. The caching mechanism stores a subset of the frequently used memory in the cache. Caches have been a big success for desktops, a trend that is likely to continue in the foreseeable future. The other alternative of using the SRAM as a scratch-pad under software control, is not a serious competitor.
For embedded systems, however, the overhead of caches comes with a more serious price. Caches incur a significant penalty in aspects like area cost, energy, hit latency and real-time guarantees. All these criteria, other than hit latency, are more important for embedded systems than desktops. Embedded computing systems refer to devices other than dedicated computers having computing processors, such as communication devices, consumer electronics, game machines, toys, industrial control systems, transportation systems, military equipment and health-care equipment.
A detailed recent study as reported in R. Banakar, S. Steinke, B -S. Lee, M. Balakrishnan, and P. Marwedel, “Scratchpad Memory: A Design Alternative for Cache On-chip Memory in Embedded Systems.” Tenth International Symposium on Hardware/Software Codesign (CODES), Estes Park, Colo., May 6-8 2002, ACM compares the tradeoffs of a cache as compared to a scratch-pad. The results are as follows: a scratch-pad memory has 34% smaller area and 40% lower power consumption than a cache memory of the same capacity. Scratch-pad memories are also often called tightly-coupled memories (TCMs) or, static random-access memories (SRAMs).
The above-mentioned savings in area and power consumption are significant, since the on-chip cache typically consumes 25-50% of the processor's area and energy consumption, a fraction that is increasing with time. Even more surprisingly, the runtime measured by Banakar et al. in cycles was 18% better with a scratch-pad using a simple static knapsack-based allocation algorithm, compared to a cache. Thus, defying conventional wisdom, Banakar et al. found absolutely no advantage to using a cache, even in high-end embedded systems in which performance is important. With the superior dynamic allocation schemes proposed here, the runtime improvement will be significantly larger. Given the power, cost, performance and real time advantages of scratch-pad, and no advantages of cache, it is expected that systems without caches will continue to dominate embedded systems in the future. Therefore, a need exists for an effective solution for scratch-pad based embedded systems.
Although many scratch-pad based embedded processors exist, utilizing them effectively has been a challenge. Central to the effectiveness of caches is their ability to maintain, at each time during program execution, the subset of data that is frequently used at that time in fast memory. The contents of cache constantly change during runtime to reflect the changing working set of data across time. Unfortunately, both existing approaches for scratch-pad allocation—program annotations and compiler-driven approaches—are static data allocations. In other words, they are incapable of changing the contents of scratch-pad at runtime. This problem is a serious limitation for existing approaches.
As an example, consider the following: let a program consist of three successive loops, the first of which makes repeated references to array A; the second to B; and the third to C. If only one of the three arrays can fit within the SRAM, any static allocation suffers DRAM accesses in two out of three arrays. In contrast, a dynamic strategy can fit all three arrays in SRAM at different times. Although this example is oversimplified, it intuitively illustrates the benefits of dynamic allocation.
Attempts so far to capture dynamic behavior in scratch-pad based systems have focused on algorithms for software caching. See, for example, G. Hallnor and S. K. Reinhardt. “A Fully Associative Software-managed Cache Design,” Proc. of the 27th Int'l Symp. on Computer Architecture (ISCA), Vancouver, British Columbia, Canada, Jun. 2000; and Csaba Andras Moritz, Matthew Frank, and Saman Amarasinghe, “FlexCache: A Framework for Flexible Compiler Generated Data Caching,” The 2nd Workshop on Intelligent Memory Systems, Boston, Mass., Nov. 12, 2000.
This class of methods involving software caching emulates the behavior of a hardware cache in software. In particular, a tag consisting of the high-order bits of the address is stored along with each cache line. Before each load/store, additional instructions are inserted by the compiler to mask out the high-order bits of the address, access the tag, to compare the tag with the high-order bits and then branch conditionally to hit or miss code. Some methods are able to reduce the number of such inserted overhead instructions, but much of it remains, especially for non-scientific programs. Needless to say, the inserted code adds significant overhead, including (i) additional run-time; (ii) higher code size, increasing dollar cost; (iii) higher data size from tags, also increasing cost; (iv) higher power consumption; and (v) memory latency that is just as unpredictable as hardware caches.
Some software caching schemes use dynamic compilation. The improvements of these schemes are small, but more importantly, in dynamic compilation the program is in RAM and is changed at runtime. In most embedded systems, however, since the program is in fixed-size and unchangeable ROM, dynamic compilation schemes cannot be used. Accordingly, a need exists for alternative approaches which are low overhead and avoid dynamic compilation which overcome the above-mentioned disadvantages and shortcomings.
A paper published in 2001 [M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh, “Dynamic Management of Scratch-Pad Memory Space,” Design Automation Conference, pages 690-695, 2001] describes a methodology of moving data back and forth between DRAM and scratch-pad. The methodology applies only to global and stack array variables with the following three additional restrictions. (i) The programs should primarily access arrays through affine (linear) functions of enclosing loop induction variables. (ii) The loops must be well-structured and must not have any other control flow, such as if-else, break and continue statements. (iii) The codes must contain these constructs in a clean way without hand-optimizations often found in many such codes, such as common sub-expression eliminations and array accesses through pointer indirections; since with these features the needed affine analysis cannot succeed. Combining these three restrictions, the methodology described by Kandemir et al. applies to well-structured scientific and multimedia codes. Unfortunately, most programs in embedded systems including many of those in the control, automotive, network, communication and even DSP domains do not fit within these strict restrictions. It has been observed that even many regular array-based codes in embedded systems violate the above restrictions, especially (ii) and (iii).
Hence, a need exists for a compiler-driven dynamic memory allocation methodology for scratch-pad based embedded systems which applies to global and stack variables, and is totally general, thus allowing codes with all kinds of accesses to variables, pointers and irregular control flow.
The methodology described by Kandemir et al. considers each loop nest independently. This has several consequences. One is that the methodology is locally optimized for each loop. Another consequence is that the methodology described by Kandemir et al. makes available the entire scratch-pad for each loop nest. The methodology described by Kandemir et al. does not exploit reuses across structures like loops. A variable which can be retained in SRAM is unnecessarily transferred between SRAM and DRAM. Accordingly, a need exists for a compiler-driven dynamic memory allocation methodology for scratch-pad based embedded systems which provides a whole-program analysis across all control structures and does not consider each loop nest independently. Such a methodology would be globally optimized for the entire program, and not locally optimized for each loop.
Based on the disadvantages and shortcomings of the prior art, a need also exists for a compiler-driven dynamic memory allocation methodology for scratch-pad based embedded systems which might choose to make available the entire scratch-pad for each loop nest, but which is not constrained to do so. Finally, a need exists for a compiler-driven dynamic memory allocation algorithm for scratch-pad based embedded systems which may choose to use part of the scratch-pad for data that is shared between successive control constructs, thus saving on transfer time to DRAM.