1. Technological Field
The present invention is related to techniques to deploy memory technologies in processor architectures to reduce leakage and dynamic energy consumption. More specifically, the present invention relates to the use of non-volatile memories in processor architectures to reduce the total leakage and dynamic energy, while meeting stringent performance requirements.
2. Description of the Related Technology
Modern processor architectures nowadays have at least two caches and a local memory (e.g. scratch pad memory): an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and optionally a translation lookaside buffer (TLB) used to speed up virtual-to-physical address translation for both executable instructions and data. Data cache is usually organized as a hierarchy of more cache levels (L1, L2, etc.).
L1 data memory (L1D) in today's processors is based on SRAMs and these are (too) energy-inefficient, both from a dynamic and leakage energy perspective. For register-based implementations the challenge is even bigger. Especially the active leakage contribution is an issue because standby leakage can be largely mitigated by recent state-of-the-art techniques (‘localized’ soft or hard power gating approaches are promising solutions for the future). The L1D layer has to supply data at the processor clock speed (or maximally 2× slower), being for example around 1 ns. This is true both for read and write operations. Moreover, sensitivity to substrate (e.g. silicon) area is still present, even though the memory sizes are quite limited. Due to the inefficiency of SRAMs, a further area reduction would be welcome.
Until today, all industrial and practically realizable compute platforms have used SRAMs or register-based LID memories. A feasible non-volatile memory solution from an integrated technology-circuit-architecture-mapping point of view where SRAM is avoided for all vector data read and write operations, has not been published. Academic work has focused on a partial replacement of SRAM only, as summarized below.
Jingtong Hu, et al. provide in “Towards Energy Efficient Hybrid On-chip Scratch Pad Memory with Non-Volatile Memory” (DATE conference 2011) a solution to the leakage energy consumption problems in scratch pad memories. In this publication, a novel hybrid scratch pad memory is proposed which consists of both non-volatile memories and SRAM. This solution takes advantage of the low leakage power and the high density of non-volatile memories and the energy efficient writes of SRAM. Apart from that, an optimal dynamic data management algorithm is proposed to realize the full potential of both the SRAM and the non-volatile memories.
In the above-mentioned document by J. Hu a technique is provided to reduce the leakage energy consumption in memories. However, they do not provide a solution whereby all SRAM memory accesses for vector data (i.e. all memory accesses which are nested loop related) can be replaced by non-volatile memories to remove the energy leaking problem of SRAM. This solution requires area as a combination of SRAM and non-volatile memory (NVM). Until today, instruction background memory at the intermediate storage level (L1I) is always selected as SRAM as the speed of NVM is not sufficient. Only for the program memory (off-chip) typically a flash device is selected.
A particularly interesting application field wherein the above-mentioned issues are relevant, relates to low power embedded systems for wireless/multimedia target applications. Embedded memories have been increasingly dominating System on Chip (SoC) designs in terms of chip area, performance, power consumption, and manufacturing yield. In many of the commercially available embedded systems today, the Instructions Memory Organization (IMO) consists of two levels: L1I and L0I. The L1I memory is comparatively larger than the L0I (about 8 to 16 times) and the L0I is closer to the data-path. The L0I is commonly implemented as a loop buffer/loop cache, as embedded instruction memories for low power wireless or multimedia applications typically have loop dominated codes.
When envisaging wireless/multimedia target applications, the use of Coarse Grained Reconfigurable Architectures (CGRAs) is appealing. CGRAs exploit the data flow dominance and offer more parallel resources. These architectures usually include a general purpose processor (either RISC based or VLIW) along with a reconfigurable array of cells which speeds up data flow based computations significantly. Programming the cell matrix requires specific memory organizations that efficiently enforce compiler decisions for every cell. This usually implies reading/writing very wide words from memory.
The paper ‘Energy Efficient Many-core Processor for Recognition and Mining using Spin-based Memory’ (R. Venkatesan et al., IEEE Int'l Symp. on Nanoscale Architectures, June 2011, pp. 122-128) describes a specific processor that has cache-memory completely consisting of non-volatile memory. The use of Spin Transfer Torque Magnetic RAM (STT-MRAM) is proposed for one of the L2 layer levels and Domain Wall Memory (DWM), a streaming access memory, for the L1 cache level. This memory requires additional shift operations to enable sharing of the read and write ports to multiple domains. However, for wireless/multimedia applications, such memory organization is not efficient.
In ‘Relaxing Non-Volatility for Fast and Energy-Efficient STT-RAM Caches’ (Smullen et al, IEEE Int'l Symp. on HPCA, February 2011, pp. 50-61) a design is described using only non-volatile memory (NVM) for cache memory. The NVM is STT-RAM. For optimal performance the properties of the STT-RAM are tuned, especially by relaxing the non-volatility. A refresh policy might be needed to hold the non-volatility. For wireless/multimedia applications such a refresh policy would however be detrimental.
The paper ‘Resistive Computation: Avoiding the Power Wall with Low-Leakage, STT-MRAM Based Computing’ (Xiaochen Guo et al.) presents a processor architecture in which most of the functionality is migrated from CMOS to STT-MRAM. Among others the L1I cache and the L1D cache are replaced by STT-MRAM. The authors claim there are no write endurance problems with STT-MRAM: for the SRAM replacement the write latency is assumed to be mitigated by a pure hardware based solution requiring extra read and compare operations whenever a write happens. Such latency is not allowable for the applications envisaged in the present invention.
U.S. Patent Publication No. US2010/0095057 discloses a non-volatile resistive sense memory on-chip cache. However, the document only describes the use of such memory for. L2 or L3 cache. The L1 cache memory is not replaced by non-volatile memory.
The issues of leakage and dynamic energy consumption are for example of particular importance in energy-sensitive applications with a high performance requirement (necessitating high clock speeds, e.g. around 1 GHz, in combination with so-called data-parallel processor solutions) and that are cost sensitive (area overhead is relevant). Also, application behavior determines data leakage in SRAM. The leakage is dependent on the 0-1 sequence of the data and especially on how long the data needs to remain available. E.g. some data need to be kept only very temporarily and the SRAM partitions that contain such data can then be powered down for part of the time to reduce the leakage.
Hence, there is a need for improving local (embedded) data and instruction memory structures with respect to energy leakage where at the same time also the dynamic energy consumption remains limited or is preferably even further reduced.